Analyzing the epidemiological outbreak of COVID‐19¶

COVID-19 impact has been notorious both from a sanitary perspective and an economic one. Plenty has been written about it, especially statistical reports on its exponential growth and the importance of “flattening the curve”.

In this Playground I want to help raise awareness of the issues associated with the spread of COVID-19 by making an analysis of the situation using Python and Data Science.

In [4]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

# external library
import plotly.express as px

%matplotlib inline

Reading the dataset¶

We will load the Johns Hopkins University | Covid-19 Confirmed Cases Globally (daily) dataset, which contains daily data about COVID-19 confirmed cases globally.

Let's load the data and quickly analyze it's columns and values:

In [5]:

!ls -l /data/time-series-covid19-confirmed_global-csv

total 1780
-rw-rw-r-- 1 nobody nogroup 1819904 Feb  2 11:08 time_series_covid19_confirmed_global.csv

In [6]:

covid_confirmed = pd.read_csv('/data/time-series-covid19-confirmed_global-csv/time_series_covid19_confirmed_global.csv')

print(covid_confirmed.shape)

covid_confirmed.head()

(289, 1147)

Out[6]:

	Province/State	Country/Region	Lat	Long	...	2/28/23	3/1/23	3/2/23	3/3/23	3/4/23	3/5/23	3/6/23	3/7/23	3/8/23	3/9/23
0	NaN	Afghanistan	33.93911	67.709953	...	209322	209340	209358	209362	209369	209390	209406	209436	209451	209451
1	NaN	Albania	41.15330	20.168300	...	334391	334408	334408	334427	334427	334427	334427	334427	334443	334457
2	NaN	Algeria	28.03390	1.659600	...	271441	271448	271463	271469	271469	271477	271477	271490	271494	271496
3	NaN	Andorra	42.50630	1.521800	...	47866	47875	47875	47875	47875	47875	47875	47875	47890	47890
4	NaN	Angola	-11.20270	17.873900	...	105255	105277	105277	105277	105277	105277	105277	105277	105288	105288

5 rows × 1147 columns

We are using a DataFrame to store our data. A pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

So far we have our dataset loaded, let's analyze it!

Cleaning our data¶

Another important step before diving into data analysis is cleaning the data.

As the data is already really clean, we'll just replace Mainland china with just China, and fill some missing values.

In [7]:

covid_confirmed['Country/Region'].replace('Mainland China', 'China', inplace=True)

In [8]:

covid_confirmed[['Province/State']] = covid_confirmed[['Province/State']].fillna('')

covid_confirmed.fillna(0, inplace=True)

Let's check null/empty values before continue:

In [9]:

covid_confirmed.isna().sum().sum()

Out[9]:

Analysis (worldwide impact) and Data Wrangling¶

With the data loaded, we will start by aggregating all the cases so we can quickly see what's going on in the world.

In [10]:

covid_confirmed_count = covid_confirmed.iloc[:, 4:].sum().max()

covid_confirmed_count

Out[10]:

676570149

Let's make a convenient plot showing how these cases increased day by day.

As we want to analyze daily worldwide aggregated values, let's remove unused columns (Province/State, Country/Region, Lat, Long) and aggregate the columns we need (all the other columns):

In [11]:

covid_worldwide_confirmed = covid_confirmed.iloc[:, 4:].sum(axis=0)
covid_worldwide_confirmed.index = pd.to_datetime(covid_worldwide_confirmed.index, format='mixed')
covid_worldwide_confirmed /= 1_000_000

covid_worldwide_confirmed.tail()

Out[11]:

2023-03-05    676.024901
2023-03-06    676.082941
2023-03-07    676.213378
2023-03-08    676.392824
2023-03-09    676.570149
dtype: float64

In [12]:

fig, ax = plt.subplots(figsize=(16, 6))

sns.lineplot(x=covid_worldwide_confirmed.index, y=covid_worldwide_confirmed, sort=False, linewidth=2)

ax.lines[0].set_linestyle("--")

plt.suptitle("COVID-19 worldwide cases over the time evolution", fontsize=16, fontweight='bold', color='white')

plt.xticks(rotation=45)
plt.ylabel('Number of cases [million]')

ax.legend(['Confirmed'])

plt.show()

Let's resample the data so we can show the dates properly:

In [13]:

fig, ax = plt.subplots(figsize=(16, 6))
ax.set(yscale="log")
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda y, _: '{:g}'.format(y)))

sns.lineplot(x=covid_worldwide_confirmed.index, y=covid_worldwide_confirmed, sort=False, linewidth=2)

ax.lines[0].set_linestyle("--")

plt.suptitle("COVID-19 worldwide cases over the time", fontsize=16, fontweight='bold', color='white')
plt.title("(logarithmic scale)", color='white')

plt.xticks(rotation=45)
plt.ylabel('Number of cases [million]')

ax.legend(['Confirmed'])

plt.show()

Visualizing worldwide COVID-19 cases in a map¶

Now we'll group rows with the same value at the Country/Region column, so we can aggregate all the values from each country in a single aggregated value.

To do that we'll use the sum() method to count all the values from the same country.

In [108]:

covid_confirmed_agg = covid_confirmed.groupby('Country/Region').sum().reset_index()

covid_confirmed_agg

Out[108]:

	Country/Region	Province/State	Lat	Long	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	...	2/28/23	3/1/23	3/2/23	3/3/23	3/4/23	3/5/23	3/6/23	3/7/23	3/8/23	3/9/23
0	Afghanistan		33.939110	67.709953	0	0	0	0	0	0	...	209322	209340	209358	209362	209369	209390	209406	209436	209451	209451
1	Albania		41.153300	20.168300	0	0	0	0	0	0	...	334391	334408	334408	334427	334427	334427	334427	334427	334443	334457
2	Algeria		28.033900	1.659600	0	0	0	0	0	0	...	271441	271448	271463	271469	271469	271477	271477	271490	271494	271496
3	Andorra		42.506300	1.521800	0	0	0	0	0	0	...	47866	47875	47875	47875	47875	47875	47875	47875	47890	47890
4	Angola		-11.202700	17.873900	0	0	0	0	0	0	...	105255	105277	105277	105277	105277	105277	105277	105277	105288	105288
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
196	West Bank and Gaza		31.952200	35.233200	0	0	0	0	0	0	...	703228	703228	703228	703228	703228	703228	703228	703228	703228	703228
197	Winter Olympics 2022		39.904200	116.407400	0	0	0	0	0	0	...	535	535	535	535	535	535	535	535	535	535
198	Yemen		15.552727	48.516388	0	0	0	0	0	0	...	11945	11945	11945	11945	11945	11945	11945	11945	11945	11945
199	Zambia		-13.133897	27.849332	0	0	0	0	0	0	...	343012	343012	343079	343079	343079	343135	343135	343135	343135	343135
200	Zimbabwe		-19.015438	29.154857	0	0	0	0	0	0	...	263921	264127	264127	264127	264127	264127	264127	264127	264276	264276

201 rows × 1147 columns

As there could be many Provinces/States within the same country, we'll calculate the mean latitude and longitude for each country and drop the Province/State column after that.

In [109]:

covid_confirmed_agg.loc[:, ['Lat', 'Long']] = covid_confirmed[['Country/Region', 'Lat', 'Long']].groupby('Country/Region').mean().reset_index().loc[:, ['Lat', 'Long']]

In [110]:

covid_confirmed_agg.drop('Province/State', axis=1, inplace=True)

Our data is now ready, but in a wrong format, so we'll need to transform our data from wide to long format, to do that we'll use the melt() pandas method.

In [111]:

print(covid_confirmed_agg.shape)

covid_confirmed_agg.head()

(201, 1146)

Out[111]:

	Country/Region	Lat	Long	...	2/28/23	3/1/23	3/2/23	3/3/23	3/4/23	3/5/23	3/6/23	3/7/23	3/8/23	3/9/23
0	Afghanistan	33.93911	67.709953	...	209322	209340	209358	209362	209369	209390	209406	209436	209451	209451
1	Albania	41.15330	20.168300	...	334391	334408	334408	334427	334427	334427	334427	334427	334443	334457
2	Algeria	28.03390	1.659600	...	271441	271448	271463	271469	271469	271477	271477	271490	271494	271496
3	Andorra	42.50630	1.521800	...	47866	47875	47875	47875	47875	47875	47875	47875	47890	47890
4	Angola	-11.20270	17.873900	...	105255	105277	105277	105277	105277	105277	105277	105277	105288	105288

5 rows × 1146 columns

Before continue let's save the country coordinates to use later:

In [112]:

country_coords = covid_confirmed_agg[['Country/Region', 'Lat', 'Long']].drop_duplicates()

country_coords.head()

Out[112]:

	Country/Region	Lat	Long
0	Afghanistan	33.93911	67.709953
1	Albania	41.15330	20.168300
2	Algeria	28.03390	1.659600
3	Andorra	42.50630	1.521800
4	Angola	-11.20270	17.873900

Use the melt method the create the proper data structure:

In [113]:

covid_confirmed_agg_long = pd.melt(covid_confirmed_agg,
                                   id_vars=covid_confirmed_agg.iloc[:, :3],
                                   var_name='date',
                                   value_vars=covid_confirmed_agg.iloc[:, 3:],
                                   value_name='date_confirmed_cases')

covid_confirmed_agg_long.drop(['Lat', 'Long'], axis=1, inplace=True)

covid_confirmed_agg_long.head()

Out[113]:

	Country/Region	date
0	Afghanistan	1/22/20
1	Albania	1/22/20
2	Algeria	1/22/20
3	Andorra	1/22/20
4	Angola	1/22/20

And resample the data into yearly buckets:

In [114]:

covid_confirmed_agg_long.set_index('date', inplace=True)
covid_confirmed_agg_long.index = pd.to_datetime(covid_confirmed_agg_long.index, format="mixed")

covid_confirmed_agg_long.head()

Out[114]:

	Country/Region	date_confirmed_cases
date
2020-01-22	Afghanistan	0
2020-01-22	Albania	0
2020-01-22	Algeria	0
2020-01-22	Andorra	0
2020-01-22	Angola	0

In [115]:

covid_confirmed_agg_long_y = covid_confirmed_agg_long.groupby('Country/Region').resample('Y').sum().drop('Country/Region', axis=1).reset_index()

covid_confirmed_agg_long_y.head()

Out[115]:

	Country/Region	date	date_confirmed_cases
0	Afghanistan	2020-12-31	8501751
1	Afghanistan	2021-12-31	39518380
2	Afghanistan	2022-12-31	67783564
3	Afghanistan	2023-12-31	14184774
4	Albania	2020-12-31	3727544

Now add the Lat and Long values to each country/region:

In [116]:

covid_confirmed_agg_long_y = covid_confirmed_agg_long_y.merge(
    country_coords,
    how='inner',
    left_on='Country/Region',
    right_on='Country/Region'
)

covid_confirmed_agg_long_y.head()

Out[116]:

	Country/Region	date	date_confirmed_cases	Lat	Long
0	Afghanistan	2020-12-31	8501751	33.93911	67.709953
1	Afghanistan	2021-12-31	39518380	33.93911	67.709953
2	Afghanistan	2022-12-31	67783564	33.93911	67.709953
3	Afghanistan	2023-12-31	14184774	33.93911	67.709953
4	Albania	2020-12-31	3727544	41.15330	20.168300

Finally, plot a map showing that data:

In [117]:

# we need a string value, that's why we are parsing the date as string
covid_confirmed_agg_long_y['date'] = covid_confirmed_agg_long_y['date'].astype(str)

fig = px.scatter_geo(covid_confirmed_agg_long_y,
                     lat="Lat", lon="Long", color="Country/Region",
                     hover_name="Country/Region", size="date_confirmed_cases",
                     size_max=50, animation_frame="date",
                     template='plotly_dark', projection="natural earth",
                     title="COVID-19 worldwide confirmed cases over time")

#fig.show()

Analyzing the epidemiological outbreak of COVID‐19

Which are the top-10 countries with more confirmed cases?¶

In [118]:

totals_country = covid_confirmed_agg_long.groupby('Country/Region').sum().reset_index()

totals_country.head()

Out[118]:

	Country/Region	date_confirmed_cases
0	Afghanistan	129988469
1	Albania	185562654
2	Algeria	182741650
3	Andorra	24547525
4	Angola	60025203

In [119]:

top_10_confirmed = totals_country.sort_values(by='date_confirmed_cases', ascending=False).head(10)

top_10_confirmed

Out[119]:

	Country/Region	date_confirmed_cases
186	US	53813184406
80	India	29131119694
24	Brazil	21182690594
63	France	16105911886
67	Germany	13686043720
190	United Kingdom	12118271679
147	Russia	10578569842
86	Italy	10083161678
184	Turkey	8840742699
94	Korea, South	8467888968

In [120]:

plt.barh(top_10_confirmed.sort_values(by='date_confirmed_cases', ascending=True)['Country/Region'],
         top_10_confirmed.sort_values(by='date_confirmed_cases', ascending=True)['date_confirmed_cases'])

plt.xlabel('Confirmed cases (million)')
plt.ylabel('Country')
plt.title('Top-10 countries with more confirmed cases')

plt.show()

Country analysis over the time¶

Another useful graphic could be exploring confirmed cases per country over the time.

We can use the already calculated covid_confirmed_agg_long dataframe which contains monthly values per country.

Using that dataframe we will filter by the country we want to analize:

In [126]:

covid_US = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'US']
covid_US = covid_US.drop(covid_US.tail(1).index)

covid_US

Out[126]:

	Country/Region	date_confirmed_cases
date
2020-01-22	US	1
2020-01-23	US	1
2020-01-24	US	2
2020-01-25	US	2
2020-01-26	US	5
...	...	...
2023-03-04	US	103650837
2023-03-05	US	103646975
2023-03-06	US	103655539
2023-03-07	US	103690910
2023-03-08	US	103755771

1142 rows × 2 columns

In [127]:

covid_China = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'China']
covid_China = covid_China.drop(covid_China.tail(1).index)

covid_Italy = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'Italy']
covid_Italy = covid_Italy.drop(covid_Italy.tail(1).index)

covid_Germany = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'Germany']
covid_Germany = covid_Germany.drop(covid_Germany.tail(1).index)

covid_Spain = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'Spain']
covid_Spain = covid_Spain.drop(covid_Spain.tail(1).index)

covid_Argentina = covid_confirmed_agg_long[covid_confirmed_agg_long['Country/Region'] == 'Argentina']
covid_Argentina = covid_Argentina.drop(covid_Argentina.tail(1).index)

In [158]:

fig, ax = plt.subplots(figsize=(16, 6))

sns.lineplot(x=covid_US.index, y=covid_US['date_confirmed_cases'], sort=False, linewidth=2, label="US")

plt.suptitle("COVID-19 per country cases over the time", fontsize=16, fontweight='bold', color='white')

plt.xticks(rotation=45)
plt.ylabel('Confirmed cases [million]')

plt.legend()

plt.show()

In [155]:

fig, ax = plt.subplots(figsize=(16, 6))

sns.lineplot(x=covid_US.index, y=covid_US['date_confirmed_cases'], sort=False, linewidth=2, label="US")
sns.lineplot(x=covid_China.index, y=covid_China['date_confirmed_cases'], sort=False, linewidth=2, label="China")
sns.lineplot(x=covid_Italy.index, y=covid_Italy['date_confirmed_cases'], sort=False, linewidth=2, label="Italy")
sns.lineplot(x=covid_Germany.index, y=covid_Germany['date_confirmed_cases'], sort=False, linewidth=2, label="Germany")
sns.lineplot(x=covid_Spain.index, y=covid_Spain['date_confirmed_cases'], sort=False, linewidth=2, label="Spain")
sns.lineplot(x=covid_Argentina.index, y=covid_Argentina['date_confirmed_cases'], sort=False, linewidth=2, label="Argentina")

plt.suptitle("COVID-19 per country cases over the time", fontsize=16, fontweight='bold', color='white')

plt.xticks(rotation=45)
plt.ylabel('Confirmed cases [million]')

plt.legend()

plt.show()

What's next?¶

You can now Copy this Playground and try to forecast next 2 weeks COVID-19 cases using regression techniques.

A basic approach could be using a linear regression:

$$ y = a.x + b $$

To create your Machine Learning model you can use scikit-learn library.

predicting