Analyzing the epidemiological outbreak of COVID‐19
Published on Feb 7, 2024 9:30 PM
9
Analyzing the epidemiological outbreak of COVID‐19¶
COVID-19 impact has been notorious both from a sanitary perspective and an economic one. Plenty has been written about it, especially statistical reports on its exponential growth and the importance of “flattening the curve”.
In this Playground I want to help raise awareness of the issues associated with the spread of COVID-19 by making an analysis of the situation using Python and Data Science.
In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
# external library
import plotly.express as px
%matplotlib inline
Reading the dataset¶
We will load the Johns Hopkins University | Covid-19 Confirmed Cases Globally (daily)
dataset, which contains daily data about COVID-19 confirmed cases globally.
Let's load the data and quickly analyze it's columns and values:
In [5]:
!ls -l /data/time-series-covid19-confirmed_global-csv
total 1780 -rw-rw-r-- 1 nobody nogroup 1819904 Feb 2 11:08 time_series_covid19_confirmed_global.csv
In [6]:
covid_confirmed = pd.read_csv('/data/time-series-covid19-confirmed_global-csv/time_series_covid19_confirmed_global.csv')
print(covid_confirmed.shape)
covid_confirmed.head()
(289, 1147)
Out[6]:
Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 2/28/23 | 3/1/23 | 3/2/23 | 3/3/23 | 3/4/23 | 3/5/23 | 3/6/23 | 3/7/23 | 3/8/23 | 3/9/23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Afghanistan | 33.93911 | 67.709953 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 209322 | 209340 | 209358 | 209362 | 209369 | 209390 | 209406 | 209436 | 209451 | 209451 |
1 | NaN | Albania | 41.15330 | 20.168300 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 334391 | 334408 | 334408 | 334427 | 334427 | 334427 | 334427 | 334427 | 334443 | 334457 |
2 | NaN | Algeria | 28.03390 | 1.659600 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 271441 | 271448 | 271463 | 271469 | 271469 | 271477 | 271477 | 271490 | 271494 | 271496 |
3 | NaN | Andorra | 42.50630 | 1.521800 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 47866 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47890 | 47890 |
4 | NaN | Angola | -11.20270 | 17.873900 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 105255 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105288 | 105288 |
5 rows × 1147 columns
We are using a DataFrame
to store our data. A pandas DataFrame
is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
So far we have our dataset loaded, let's analyze it!
Cleaning our data¶
Another important step before diving into data analysis is cleaning the data.
As the data is already really clean, we'll just replace Mainland china
with just China
, and fill some missing values.
In [7]:
covid_confirmed['Country/Region'].replace('Mainland China', 'China', inplace=True)
In [8]:
covid_confirmed[['Province/State']] = covid_confirmed[['Province/State']].fillna('')
covid_confirmed.fillna(0, inplace=True)
Let's check null/empty values before continue:
In [9]:
covid_confirmed.isna().sum().sum()
Out[9]:
0
Analysis (worldwide impact) and Data Wrangling¶
With the data loaded, we will start by aggregating all the cases so we can quickly see what's going on in the world.
In [10]:
covid_confirmed_count = covid_confirmed.iloc[:, 4:].sum().max()
covid_confirmed_count
Out[10]:
676570149
Let's make a convenient plot showing how these cases increased day by day.
As we want to analyze daily worldwide aggregated values, let's remove unused columns (Province/State
, Country/Region
, Lat
, Long
) and aggregate the columns we need (all the other columns):
In [11]:
covid_worldwide_confirmed = covid_confirmed.iloc[:, 4:].sum(axis=0)
covid_worldwide_confirmed.index = pd.to_datetime(covid_worldwide_confirmed.index, format='mixed')
covid_worldwide_confirmed /= 1_000_000
covid_worldwide_confirmed.tail()
Out[11]:
2023-03-05 676.024901 2023-03-06 676.082941 2023-03-07 676.213378 2023-03-08 676.392824 2023-03-09 676.570149 dtype: float64
In [12]:
fig, ax = plt.subplots(figsize=(16, 6))
sns.lineplot(x=covid_worldwide_confirmed.index, y=covid_worldwide_confirmed, sort=False, linewidth=2)
ax.lines[0].set_linestyle("--")
plt.suptitle("COVID-19 worldwide cases over the time evolution", fontsize=16, fontweight='bold', color='white')
plt.xticks(rotation=45)
plt.ylabel('Number of cases [million]')
ax.legend(['Confirmed'])
plt.show()
Let's resample the data so we can show the dates properly:
In [13]:
fig, ax = plt.subplots(figsize=(16, 6))
ax.set(yscale="log")
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda y, _: '{:g}'.format(y)))
sns.lineplot(x=covid_worldwide_confirmed.index, y=covid_worldwide_confirmed, sort=False, linewidth=2)
ax.lines[0].set_linestyle("--")
plt.suptitle("COVID-19 worldwide cases over the time", fontsize=16, fontweight='bold', color='white')
plt.title("(logarithmic scale)", color='white')
plt.xticks(rotation=45)
plt.ylabel('Number of cases [million]')
ax.legend(['Confirmed'])
plt.show()
Visualizing worldwide COVID-19 cases in a map¶
Now we'll group rows with the same value at the Country/Region
column, so we can aggregate all the values from each country in a single aggregated value.
To do that we'll use the sum()
method to count all the values from the same country.
In [108]:
covid_confirmed_agg = covid_confirmed.groupby('Country/Region').sum().reset_index()
covid_confirmed_agg
Out[108]:
Country/Region | Province/State | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | ... | 2/28/23 | 3/1/23 | 3/2/23 | 3/3/23 | 3/4/23 | 3/5/23 | 3/6/23 | 3/7/23 | 3/8/23 | 3/9/23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 33.939110 | 67.709953 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 209322 | 209340 | 209358 | 209362 | 209369 | 209390 | 209406 | 209436 | 209451 | 209451 | |
1 | Albania | 41.153300 | 20.168300 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 334391 | 334408 | 334408 | 334427 | 334427 | 334427 | 334427 | 334427 | 334443 | 334457 | |
2 | Algeria | 28.033900 | 1.659600 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 271441 | 271448 | 271463 | 271469 | 271469 | 271477 | 271477 | 271490 | 271494 | 271496 | |
3 | Andorra | 42.506300 | 1.521800 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 47866 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47890 | 47890 | |
4 | Angola | -11.202700 | 17.873900 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 105255 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105288 | 105288 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
196 | West Bank and Gaza | 31.952200 | 35.233200 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | 703228 | |
197 | Winter Olympics 2022 | 39.904200 | 116.407400 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 | 535 | |
198 | Yemen | 15.552727 | 48.516388 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | 11945 | |
199 | Zambia | -13.133897 | 27.849332 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 343012 | 343012 | 343079 | 343079 | 343079 | 343135 | 343135 | 343135 | 343135 | 343135 | |
200 | Zimbabwe | -19.015438 | 29.154857 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 263921 | 264127 | 264127 | 264127 | 264127 | 264127 | 264127 | 264127 | 264276 | 264276 |
201 rows × 1147 columns
As there could be many Provinces/States within the same country, we'll calculate the mean
latitude and longitude for each country and drop the Province/State
column after that.
In [109]:
covid_confirmed_agg.loc[:, ['Lat', 'Long']] = covid_confirmed[['Country/Region', 'Lat', 'Long']].groupby('Country/Region').mean().reset_index().loc[:, ['Lat', 'Long']]
In [110]:
covid_confirmed_agg.drop('Province/State', axis=1, inplace=True)
Our data is now ready, but in a wrong format, so we'll need to transform our data from wide to long format, to do that we'll use the melt()
pandas method.
In [111]:
print(covid_confirmed_agg.shape)
covid_confirmed_agg.head()
(201, 1146)
Out[111]:
Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | 1/25/20 | 1/26/20 | 1/27/20 | 1/28/20 | ... | 2/28/23 | 3/1/23 | 3/2/23 | 3/3/23 | 3/4/23 | 3/5/23 | 3/6/23 | 3/7/23 | 3/8/23 | 3/9/23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 33.93911 | 67.709953 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 209322 | 209340 | 209358 | 209362 | 209369 | 209390 | 209406 | 209436 | 209451 | 209451 |
1 | Albania | 41.15330 | 20.168300 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 334391 | 334408 | 334408 | 334427 | 334427 | 334427 | 334427 | 334427 | 334443 | 334457 |
2 | Algeria | 28.03390 | 1.659600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 271441 | 271448 | 271463 | 271469 | 271469 | 271477 | 271477 | 271490 | 271494 | 271496 |
3 | Andorra | 42.50630 | 1.521800 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 47866 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47875 | 47890 | 47890 |
4 | Angola | -11.20270 | 17.873900 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 105255 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105277 | 105288 | 105288 |
5 rows × 1146 columns
Before continue let's save the country coordinates to use later:
In [112]:
country_coords = covid_confirmed_agg[['Country/Region', 'Lat', 'Long']].drop_duplicates()
country_coords.head()
Out[112]:
Country/Region | Lat | Long | |
---|---|---|---|
0 | Afghanistan | 33.93911 | 67.709953 |
1 | Albania | 41.15330 | 20.168300 |
2 | Algeria | 28.03390 | 1.659600 |
3 | Andorra | 42.50630 | 1.521800 |
4 | Angola | -11.20270 | 17.873900 |
Use the melt method the create the proper data structure:
In [113]:
covid_confirmed_agg_long = pd.melt(covid_confirmed_agg,
id_vars=covid_confirmed_agg.iloc[:, :3],
var_name='date',
value_vars=covid_confirmed_agg.iloc[:, 3:],
value_name='date_confirmed_cases')
covid_confirmed_agg_long.drop(['Lat', 'Long'], axis=1, inplace=True)
covid_confirmed_agg_long.head()
Out[113]:
Country/Region | date | date_confirmed_cases | |
---|---|---|---|
0 | Afghanistan | 1/22/20 | 0 |
1 | Albania | 1/22/20 | 0 |
2 | Algeria | 1/22/20 | 0 |
3 | Andorra | 1/22/20 | 0 |
4 | Angola | 1/22/20 | 0 |
And resample the data into yearly buckets:
In [114]:
covid_confirmed_agg_long.set_index('date', inplace=True)
covid_confirmed_agg_long.index = pd.to_datetime(covid_confirmed_agg_long.index, format="mixed")
covid_confirmed_agg_long.head()
Out[114]:
Country/Region | date_confirmed_cases | |
---|---|---|
date | ||
2020-01-22 | Afghanistan | 0 |
2020-01-22 | Albania | 0 |
2020-01-22 | Algeria | 0 |
2020-01-22 | Andorra | 0 |
2020-01-22 | Angola | 0 |
In [115]:
covid_confirmed_agg_long_y = covid_confirmed_agg_long.groupby('Country/Region').resample('Y').sum().drop('Country/Region', axis=1).reset_index()
covid_confirmed_agg_long_y.head()
Out[115]:
Country/Region | date | date_confirmed_cases | |
---|---|---|---|
0 | Afghanistan | 2020-12-31 | 8501751 |
1 | Afghanistan | 2021-12-31 | 39518380 |
2 | Afghanistan | 2022-12-31 | 67783564 |
3 | Afghanistan | 2023-12-31 | 14184774 |
4 | Albania | 2020-12-31 | 3727544 |
Now add the Lat
and Long
values to each country/region:
In [116]:
covid_confirmed_agg_long_y = covid_confirmed_agg_long_y.merge(
country_coords,
how='inner',
left_on='Country/Region',
right_on='Country/Region'
)
covid_confirmed_agg_long_y.head()
Out[116]:
Country/Region | date | date_confirmed_cases | Lat | Long | |
---|---|---|---|---|---|
0 | Afghanistan | 2020-12-31 | 8501751 | 33.93911 | 67.709953 |
1 | Afghanistan | 2021-12-31 | 39518380 | 33.93911 | 67.709953 |
2 | Afghanistan | 2022-12-31 | 67783564 | 33.93911 | 67.709953 |
3 | Afghanistan | 2023-12-31 | 14184774 | 33.93911 | 67.709953 |
4 | Albania | 2020-12-31 | 3727544 | 41.15330 | 20.168300 |
Finally, plot a map showing that data:
In [117]:
# we need a string value, that's why we are parsing the date as string
covid_confirmed_agg_long_y['date'] = covid_confirmed_agg_long_y['date'].astype(str)
fig = px.scatter_geo(covid_confirmed_agg_long_y,
lat="Lat", lon="Long", color="Country/Region",
hover_name="Country/Region", size="date_confirmed_cases",
size_max=50, animation_frame="date",
template='plotly_dark', projection="natural earth",
title="COVID-19 worldwide confirmed cases over time")
#fig.show()
Which are the top-10 countries with more confirmed cases?¶
In [118]:
totals_country = covid_confirmed_agg_long.groupby('Country/Region').sum().reset_index()
totals_country.head()
Out[118]:
Country/Region | date_confirmed_cases | |
---|---|---|
0 | Afghanistan | 129988469 |
1 | Albania | 185562654 |
2 | Algeria | 182741650 |
3 | Andorra | 24547525 |
4 | Angola | 60025203 |
In [119]:
top_10_confirmed = totals_country.sort_values(by='date_confirmed_cases', ascending=False).head(10)
top_10_confirmed
Out[119]:
Country/Region | date_confirmed_cases | |
---|---|---|
186 | US | 53813184406 |
80 | India | 29131119694 |
24 | Brazil | 21182690594 |
63 | France | 16105911886 |
67 | Germany | 13686043720 |
190 | United Kingdom | 12118271679 |
147 | Russia | 10578569842 |
86 | Italy | 10083161678 |
184 | Turkey | 8840742699 |
94 | Korea, South | 8467888968 |
In [120]:
plt.barh(top_10_confirmed.sort_values(by='date_confirmed_cases', ascending=True)['Country/Region'],
top_10_confirmed.sort_values(by='date_confirmed_cases', ascending=True)['date_confirmed_cases'])
plt.xlabel('Confirmed cases (million)')
plt.ylabel('Country')
plt.title('Top-10 countries with more confirmed cases')
plt.show()
Country analysis over the time¶
Another useful graphic could be exploring confirmed cases per country over the time.
We can use the already calculated covid_confirmed_agg_long
dataframe which contains monthly values per country.
Using that dataframe we will filter by the country we want to analize:
In [126]:
covid_US = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'US']
covid_US = covid_US.drop(covid_US.tail(1).index)
covid_US
Out[126]:
Country/Region | date_confirmed_cases | |
---|---|---|
date | ||
2020-01-22 | US | 1 |
2020-01-23 | US | 1 |
2020-01-24 | US | 2 |
2020-01-25 | US | 2 |
2020-01-26 | US | 5 |
... | ... | ... |
2023-03-04 | US | 103650837 |
2023-03-05 | US | 103646975 |
2023-03-06 | US | 103655539 |
2023-03-07 | US | 103690910 |
2023-03-08 | US | 103755771 |
1142 rows × 2 columns
In [127]:
covid_China = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'China']
covid_China = covid_China.drop(covid_China.tail(1).index)
covid_Italy = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'Italy']
covid_Italy = covid_Italy.drop(covid_Italy.tail(1).index)
covid_Germany = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'Germany']
covid_Germany = covid_Germany.drop(covid_Germany.tail(1).index)
covid_Spain = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'Spain']
covid_Spain = covid_Spain.drop(covid_Spain.tail(1).index)
covid_Argentina = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'Argentina']
covid_Argentina = covid_Argentina.drop(covid_Argentina.tail(1).index)
In [158]:
fig, ax = plt.subplots(figsize=(16, 6))
sns.lineplot(x=covid_US.index, y=covid_US['date_confirmed_cases'], sort=False, linewidth=2, label="US")
plt.suptitle("COVID-19 per country cases over the time", fontsize=16, fontweight='bold', color='white')
plt.xticks(rotation=45)
plt.ylabel('Confirmed cases [million]')
plt.legend()
plt.show()
In [155]:
fig, ax = plt.subplots(figsize=(16, 6))
sns.lineplot(x=covid_US.index, y=covid_US['date_confirmed_cases'], sort=False, linewidth=2, label="US")
sns.lineplot(x=covid_China.index, y=covid_China['date_confirmed_cases'], sort=False, linewidth=2, label="China")
sns.lineplot(x=covid_Italy.index, y=covid_Italy['date_confirmed_cases'], sort=False, linewidth=2, label="Italy")
sns.lineplot(x=covid_Germany.index, y=covid_Germany['date_confirmed_cases'], sort=False, linewidth=2, label="Germany")
sns.lineplot(x=covid_Spain.index, y=covid_Spain['date_confirmed_cases'], sort=False, linewidth=2, label="Spain")
sns.lineplot(x=covid_Argentina.index, y=covid_Argentina['date_confirmed_cases'], sort=False, linewidth=2, label="Argentina")
plt.suptitle("COVID-19 per country cases over the time", fontsize=16, fontweight='bold', color='white')
plt.xticks(rotation=45)
plt.ylabel('Confirmed cases [million]')
plt.legend()
plt.show()
What's next?¶
You can now Copy this Playground and try to forecast next 2 weeks COVID-19 cases using regression techniques.
A basic approach could be using a linear regression:
$$ y = a.x + b $$To create your Machine Learning model you can use scikit-learn
library.