COVID Risk Analysis Project

Background:

In 2019, the world was exposed to SARS-CoV-2, creating panic that devastated the world in all aspects - socially, politically and economically - with nearly nearly 6 million deaths to date. To understand the deadly disease one must understand the morphology of SARS-CoV-2, specifically the spike glycoprotein. The spike protein fuses to the cell membrane by binding to the N terminal and entering the cell to where the virus is replicated. There have been other outbreaks of viruses with spike proteins, such as SARS and MERS. Many biological factors have lead to this global pandemic, such as the high transmissibility of the virus as well as deadly complications such as MIS-C (Multi-system Inflammatory Syndrome in Children) or multi-system organ failure induced by the cytokine storm. The Omicron variant has been shown to be highly mutable with a total of 50 point mutations - 36 of which are on the spike protein, making this one of the most transmissible variants to date.

Methods:

Source of Data

Due to the effect SARS-CoV-2 has had on the world, an interest was taken in understanding the effects of the virus on different communities. The goal was to create a predictive model looking at a variety of factors such as vaccination status, testing, comorbidities, etc... based on this dataset from Our World in Data: Github, OWID. After initial extraction, the data was transformed using Pandas, then it was loaded into a database using Postgres, and finally the data was visualized using Tableau. The random forest machine learning model will be used to start the analysis. The model was then used to determine the impact of COVID in communities. The purpose of the research is to see if there are any factors that could contribute to large outbreaks.

The Google Slides can be found here.

Software

Python (sklearn, matplotlib)
Jupyter Notebook
Postgres SQL / SQL Alchemy
Tableau

Data Cleaning

Our raw dataset contained 151,776 rows x 67 columns. Initial findings on the dataset suggest that 'date' column was not the correct datatype, and it had to be converted using pd.datetime. Null values and less populated features had to be dropped to ensure the model's efficiency & accuracy. Ultimately, the cleaned dataset contained 40,877 rows x 26 columns. Below are the detailed steps taken in cleaning the dataset.

Remove all iso codes starting with OWID because they are aggregated rows not counted rows
Dropping all columns that start with excess because they are mostly null columns
Dropped all columns with less than 10k rows of data
Dropped cumulative columns because they are aggregated columns not counting up as they go along
Dropped all columns with less than 1/2 their rows populated
Dropped all smooth columns
Dropped all na rows
Dropped iso codes as it is redundant with Location

Data Preprocessing

Removing the new_deaths column from DataFrame and setting it as X
Using LabelEncode for Data and Location Columns
Using the OneHotEncoder on continent
Merging the encoded continent with our features DataFrame and dropping the continent column
Using StandardScaler for all columns

Data Training & Model Choice

RandomForestClassifier
For the data analysis, the initial plan was to use RandomForestClassifer to determine a country's predicted state of COVID risk. The model is to forecast a future month's number of COVID deaths based on a number of features. We used sklearn's model_selection module, imported train_test_split class and used the default setting for testing & training sizes (75/25). Only the last 3 months worth of data was analyzed when considering vaccination/booster efforts. The features are as follows:

	Features
Country	Date	New Cases
New Cases	Reproduction Rate	Stringency Index
Population	Population Density	Median Age
Number of Age 65+	Number of Age 70+	GDP per Capita
Extreme Poverty	Cardiovascular Health	Diabetes Prevalence
Female Smokers	Male Smokers	Hospital Beds per Thousand
Life Expectancy	Human Development Index

Based on these features, the model will classify if a country's forecasted monthly COVID deaths is Bad, Worse or Worst. This will involve bucketing projected outcomes (e.g. >10,000 deaths - Bad, 10,000 to 50,000 deaths - Worse, 50,000 deaths - Worst)

RandomForestRegression
In the middle of the analysis, to better represent a time series data we then opted for a RandomForestRegression instead. This will keep our findings cleaner with a definite projected number per month wherein we can pinpoint environment seasonalities (vaccinations started, vaccinations availablity, boosters, re-openings, etc.). We used the same features as stated above but used the all available dates instead of looking at the last 3 months. We also experimented on New Cases as our target variable for our model, this will be insightful if we see the number of COVID deaths plateuing because of vaccination efforts, looking at New Cases instead will keep our model relevant.

Boosters & Bonus
We also experimented with using GradientBoosting, DecisionTreeRegressor, KNeighborsRegressor, LassoCV, SGDRegressor, BayesianRidge, SVR, and XGBRegressor reggression on our dataframe, alongside adding AdaBoost Regression to the top three performing models.

Results

Overall, many of the scores accumalated showed an accuracy score of 94% or higher. We tested 10 different regression models and used loops to tune their features in order to achieve the best results possible. The results were as follows:

Random Forest Regression

Training R2: 99.07%

--> Testing R2: 94.16%

Gradient Boosting Regression

Training R2: 97.99%

Testing R2: 94.06%
Decision Tree Regression

Training R2: 99.51%

Testing R2: 91.13%
Extreme Gradient Boosting Regression

Training R2: 97.41%

--> Testing R2: 94.56%

Histogram-based Gradient Boosting Regression Tree

Training R2: 95.01%

Testing R2: 94.32%
Regression based on k-nearest neighbors

Training R2: 100%

Testing R2: 90.12%
Lasso linear model with iterative fitting along a regularization path

Training R2: 18.31%

Testing R2: 16.40%
Stochastic Gradient Descent Regression

Training R2: 69.50%

Testing R2: 67.32%
Bayesian Ridge Regression

Training R2: 69.59%

Testing R2: 67.26%
Support Vector Machine

Training R2: 19.48%

Testing R2: 18.93%

The Extreme Gradient Boosting Regressor provided the best results with an accuracy of 94.56%

AdaBoost

Trial 1: RandomForestRegressor, AdaBoost

Testing R2: 93.58%
Trial 2: XGBRegressor, AdaBoost

Testing R2: 94.13%
Trial 3: HistGradientBoostingRegressor, AdaBoost

Testing R2: 93.35%

Stacking_Regressor_Results

Trial 1: KNeighborsRegressor, DecisionTreeRegressor, HistGradientBoostingRegressor, LassoCV, XGBRegressor, RandomForestRegressor

Training R2: 97.03%

--> Testing R2: 94.52%

Trial 2: KNeighborsRegressor, RandomForestRegressor, HistGradientBoostingRegressor, LassoCV, DecisionTreeRegressor, XGBRegressor

Training R2: 95.89%

Testing R2: 93.02%
Trial 3: KNeighborsRegressor, RandomForestRegressor, LassoCV, DecisionTreeRegressor, HistGradientBoostingRegressor, XGBRegressor

Training R2: 95.88%

Testing R2: 93.02%
Trial 4: XGBRegressor, KNeighborsRegressor, RandomForestRegressor, LassoCV, DecisionTreeRegressor, HistGradientBoostingRegressor, XGBRegressor

Training R2: 96.24%

Testing R2: 93.98%
Trial 5: RandomForestRegressor, HistGradientBoostingRegressor, XGBRegressor

Training R2: 95.86%

Testing R2: 93.40%
Trial 6: HistGradientBoostingRegressor, XGBRegressor

Training R2: 93.52%

Testing R2: 93.71%
Trial 7: RandomForestRegressor, XGBRegressor

Training R2: 97.31%

Testing R2: 92.84%
Trial 8: XGBRegressor, RandomForestRegressor

Training R2: 93.35%

Testing R2: 91.33%
Trial 9: RandomForestRegressor, KNeighborsRegressor, XGBRegressor

Training R2: 97.63%

Testing R2: 92.07%
Trial 10: RandomForestRegressor, LassoCV, XGBRegressor

Training R2: 96.49%

Testing R2: 92.27%

Although the results were close, Trial 1's R2 of 94.52% fell just short of the Extreme Gradient Boosting Regressor results

Stacking_w/XGBRegressor: When stacked with another regressor, XGBRegressor results worsened from the individual accuracy of 94.55%

Trial 1: RandomForestRegressor, XGBRegressor

Training R2: 97.31%

Testing R2: 92.84%
Trial 2: KNeighborsRegressor, XGBRegressor

Training R2: 97.63%

Testing R2: 88.75%
Trial 3: LassoCV, XGBRegressor

Training R2: 49.24%

Testing R2: 48.95%
Trial 4: DecisionTreeRegressor, XGBRegressor

Training R2: 97.21%

Testing R2: 90.40%
Trial 5: HistGradientBoostingRegressor, XGBRegressor

Training R2: 93.52%

Testing R2: 93.71%
Trial 6: SVR, XGBRegressor

Training R2: 41.74%

Testing R2: 38.93%
Trial 7: BayesianRidge, XGBRegressor

Training R2: 72.95%

Testing R2: 73.39%
Trial 8: SGDRegressor, XGBRegressor

Training R2: 72.54%

Testing R2: 72.59%
Trial 9: GradientBoostingRegressor, XGBRegressor

Training R2: 96.11%

Testing R2: 92.63%

Stacking_w/RandomForestRegressor: When stacked with another regressor, RandomForestRegressor results worsened from the individual accuracy of 94.16%

Trial 1: XGBRegressor, RandomForestRegressor

Training R2: 93.35%

Testing R2: 91.33%
Trial 2: KNeighborsRegressor, RandomForestRegressor

Training R2: 97.63%

Testing R2: 88.75%
Trial 3: LassoCV, RandomForestRegressor

Training R2: 49.29%

Testing R2: 48.94%
Trial 4: DecisionTreeRegressor, RandomForestRegressor

Training R2: 93.32%

Testing R2: 88.35%
Trial 5: HistGradientBoostingRegressor, RandomForestRegressor

Training R2: 91.52%

Testing R2: 89.87%
Trial 6: SVR, RandomForestRegressor

Training R2: 1.64%

Testing R2: -0.91%
Trial 7: BayesianRidge, RandomForestRegressor

Training R2: 67.43%

Testing R2: 63.53%
Trial 8: SGDRegressor, RandomForestRegressor

Training R2: 62.14%

Testing R2: 61.93%
Trial 9: GradientBoostingRegressor, RandomForestRegressor

Training R2: 93.54%

Testing R2: 89.69%

Name		Name	Last commit message	Last commit date
Latest commit History 249 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Resources		Resources
.gitignore		.gitignore
COVID_Start_Jupyter_Notebook.ipynb		COVID_Start_Jupyter_Notebook.ipynb
README.md		README.md
Requirements.txt		Requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID Risk Analysis Project

Background:

Methods:

Source of Data

Software

Data Cleaning

Data Preprocessing

Data Training & Model Choice

Results

The Extreme Gradient Boosting Regressor provided the best results with an accuracy of 94.56%

AdaBoost

Stacking_Regressor_Results

Although the results were close, Trial 1's R2 of 94.52% fell just short of the Extreme Gradient Boosting Regressor results

Stacking_w/XGBRegressor: When stacked with another regressor, XGBRegressor results worsened from the individual accuracy of 94.55%

Stacking_w/RandomForestRegressor: When stacked with another regressor, RandomForestRegressor results worsened from the individual accuracy of 94.16%

ERD

About

Releases

Packages

Contributors 5

Languages

RobC30/COVID-Risk-Analysis

Folders and files

Latest commit

History

Repository files navigation

COVID Risk Analysis Project

Background:

Methods:

Source of Data

Software

Data Cleaning

Data Preprocessing

Data Training & Model Choice

Results

The Extreme Gradient Boosting Regressor provided the best results with an accuracy of 94.56%

AdaBoost

Stacking_Regressor_Results

Although the results were close, Trial 1's R2 of 94.52% fell just short of the Extreme Gradient Boosting Regressor results

Stacking_w/XGBRegressor: When stacked with another regressor, XGBRegressor results worsened from the individual accuracy of 94.55%

Stacking_w/RandomForestRegressor: When stacked with another regressor, RandomForestRegressor results worsened from the individual accuracy of 94.16%

ERD

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages