Capstone Project Overview

Meredith Newhouse
4 min readMay 5, 2021

In this post I am going to describe a high level overview of my final data science capstone project for Flatiron School. In a following post I will dive deeper into the code and methods I used to complete the project.

For my final project, I wanted to create an analysis on something that is relevant and impactful on the world today. This quickly lead me to the COVID19 pandemic. I began this course in the summer of 2020, and spent most of the pandemic learning data science and analysis. I realized that COVID19 was the perfect subject for my project. I also quickly realized how much data has been collected about the pandemic. COVID19 is a topic with ample data, as well as real world relevance. My question became, have mandates been effective in mitigating the virus, and if so which ones?

I found that the CDC has API’s for their data and I also found a plethora of data on Data.gov and USAfacts.org. Through all of my research I was able to find data for every state and every county within that state, on number of cases — associated with dates throughout the pandemic — , as well as state mandates by county and date for restaurants, gatherings, and masks. I was able to aggregate all of this data into data tables for each state. Each data table was then organized by county and date, displaying when mandates were in place and the corresponding COVID cases on those dates. I also had the population of each county so that I could create a new column for the case rate, by dividing the cases over the population.

Sample of Colorado data table, actual table has 20,928 rows

After a lot, and I mean a lot, of data cleaning and manipulation I was able to begin the modeling process. I decided to build a model, first for Colorado, that could predict the case rate based on the mandates that were in place. Since I was predicting a number value, this would require a regression model. After some modeling and the use of Gridsearchcv, I created my final Gradient Boosting Regression model.

Since there are so many factors that impact case rate, and Colorado often put a mandate in and kept it the whole time (not very much variation in the data) the best model I could create had a:

  • Mean Absolute Error: 0.008024926274175495
  • Mean Squared Error: 0.0003489350913987028
  • Root Mean Squared Error: 0.01867980437260259

Though those numbers look small, the average case rate in Colorado was .01687 with a standard deviation of .0277. So the error rates weren’t terrible, but they also weren’t great either.

However, I wondered how my model would do with a state that used mandates very differently than Colorado. Next I ran a model for Florida. Florida was more likely to not have any mandates at all, or they implemented mandates and then took them away. Florida even had another value under restaurant orders called ‘Authorized to fully reopen’ which Colorado never had. Not surprisingly, the model did do better with Florida’s data. Florida’s average case rate was .02791 and the errors of the model were:

  • Mean Absolute Error: 0.011297040577091784
  • Mean Squared Error: 0.0003035919001125868
  • Root Mean Squared Error: 0.017423888776980492

When I looked at feature importance, the most important feature for my Florida model was the restaurant orders most likely due to that extra category.

After doing the modeling and some post modeling EDA, it suggested that mandates may have had an impact on the case rate.

Florida’s case rate and restaurant orders
Colorado’s case rate and restaurant orders

I could see that overall Colorado had a lower case rate and implemented more mandates than Florida. I could see that Florida’s restaurant mandates, and lack of mandates, really helped the model predict the case rate. This indicates that having restaurants fully reopen, along with no mask mandates, and less gathering bans, may result in a negative impact on COVID case rate.

In the US, where travel between states is frequent, isolating states and the impact of their individual mandates is difficult. However, with the data in two states that implemented mandates very differently, Florida and Colorado, it is clear that mandates may have helped mitigate the impact of the virus. From the model, I could see that the restaurant orders, and the removal of restrictions, in Florida had an impact on the case rate predictions made by the model.

Florida’s restaurant mandates, and lack of mandates, helped the model predict the case rate. This indicates that having restaurants fully reopen, along with no mask mandates, and less gathering bans, may negatively impact COVID case rate.

In conclusion, I believe the use of mandates are important, and governments should continue to implement them based on current risks with the current COVID19 pandemic and should use them for pandemics in the future

--

--