Statement of Completion#4e91be43
Data Cleaning with Pandas
easy
Handling Null Values: Athletes Events Data
Resolution
Activities
In [31]:
import numpy as np
import pandas as pd
In [32]:
import warnings
warnings.filterwarnings('ignore')
In [33]:
df = pd.read_csv('athlete_events_final.csv')
In [34]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 271116 entries, 0 to 271115 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 271116 non-null int64 1 Name 271116 non-null object 2 Sex 271116 non-null object 3 Age 261642 non-null float64 4 Height 210945 non-null float64 5 Weight 208241 non-null float64 6 Team 271116 non-null object 7 NOC 271111 non-null object 8 Games 271116 non-null object 9 Year 271116 non-null int64 10 Season 271116 non-null object 11 City 271116 non-null object 12 Sport 271116 non-null object 13 Event 271116 non-null object 14 Medal 39783 non-null object dtypes: float64(3), int64(2), object(10) memory usage: 31.0+ MB
In [35]:
df.describe()
Out[35]:
ID | Age | Height | Weight | Year | |
---|---|---|---|---|---|
count | 271116.000000 | 261642.000000 | 210945.000000 | 208241.000000 | 271116.000000 |
mean | 68248.954396 | 25.556898 | 175.338970 | 70.702393 | 1978.378480 |
std | 39022.286345 | 6.393561 | 10.518462 | 14.348020 | 29.877632 |
min | 1.000000 | 10.000000 | 127.000000 | 25.000000 | 1896.000000 |
25% | 34643.000000 | 21.000000 | 168.000000 | 60.000000 | 1960.000000 |
50% | 68205.000000 | 24.000000 | 175.000000 | 70.000000 | 1988.000000 |
75% | 102097.250000 | 28.000000 | 183.000000 | 79.000000 | 2002.000000 |
max | 135571.000000 | 97.000000 | 226.000000 | 214.000000 | 2016.000000 |
1. What is the code that we use to get the count of null values column-wise?¶
In [36]:
df.isnull().sum()
Out[36]:
ID 0 Name 0 Sex 0 Age 9474 Height 60171 Weight 62875 Team 0 NOC 5 Games 0 Year 0 Season 0 City 0 Sport 0 Event 0 Medal 231333 dtype: int64
2. Find and Store Null Values in Each Column¶
In [37]:
null_count = df.isnull().sum() # Enter your code here
3. Which of the following columns doesn't contain any missing values?¶
In [ ]:
4. What is the column that has the second highest number of null values in the DataFrame df
?¶
In [38]:
print(null_count)
ID 0 Name 0 Sex 0 Age 9474 Height 60171 Weight 62875 Team 0 NOC 5 Games 0 Year 0 Season 0 City 0 Sport 0 Event 0 Medal 231333 dtype: int64
5. Calculate Percentage of Null Values in Each Column¶
In [39]:
null_percentage_df = (df.isnull().sum()/len(df)) * 100 # Enter your code here
6. What is the column that has the highest percentage of missing values in the DataFramedf
?¶
In [ ]:
7. Which function is used to drop rows that contain missing values?¶
In [ ]:
8. Remove Columns with More Than 50%
Missing Values¶
In [40]:
threshold = len(df) * 0.5
df = df.dropna(axis=1, thresh=threshold)
9. Drop All Rows with Null Values¶
In [41]:
df_no_null = df.dropna() # Enter your code here
df_no_null = df_no_null.reset_index(drop=True)
10. Which of the follwoing imputation method is most suitable for filling missing data in continuous numerical columns?¶
In [ ]:
11. Which of the following is the best imputation to use for normally distributed data without outliers?¶
In [ ]:
12. Which method is used to fill missing values with the mean of a column?¶
In [ ]:
13. What does interpolation do when handling missing data?¶
In [ ]:
14. Fill Missing Values with Median¶
In [42]:
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Height'].fillna(df['Height'].median(), inplace=True)
df['Weight'].fillna(df['Weight'].median(), inplace=True)
You can see below that there are 5 missing values in the NOC
(National Olympic Committee) column. Now, let's handle missing values in that column.
Since there are only 5
missing values in that column, we don't need to use imputation techniques unnecessarily. Instead, we can simply enter the NOCs of the teams, which is quite straightforward. The NOC
for the United States
is USA
, Japan
is JPN
, Greece
is GRE
, Italy
is ITA
, and Bulgaria
is BUL
.
15. Fill Null Values in the NOC
Column¶
In [46]:
df['NOC'] = df['NOC'].fillna(df['Team'].map({
'United States': 'USA',
'Japan': 'JPN',
'Greece': 'GRE',
'Italy': 'ITA',
'Bulgaria': 'BUL'
}))
df
Out[46]:
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball |
1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight |
2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | 175.0 | 70.0 | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football |
3 | 4 | Edgar Lindenau Aabye | M | 34.0 | 175.0 | 70.0 | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War |
4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
271111 | 135569 | Andrzej ya | M | 29.0 | 179.0 | 89.0 | Poland-1 | POL | 1976 Winter | 1976 | Winter | Innsbruck | Luge | Luge Mixed (Men)'s Doubles |
271112 | 135570 | Piotr ya | M | 27.0 | 176.0 | 59.0 | Poland | POL | 2014 Winter | 2014 | Winter | Sochi | Ski Jumping | Ski Jumping Men's Large Hill, Individual |
271113 | 135570 | Piotr ya | M | 27.0 | 176.0 | 59.0 | Poland | POL | 2014 Winter | 2014 | Winter | Sochi | Ski Jumping | Ski Jumping Men's Large Hill, Team |
271114 | 135571 | Tomasz Ireneusz ya | M | 30.0 | 185.0 | 96.0 | Poland | POL | 1998 Winter | 1998 | Winter | Nagano | Bobsleigh | Bobsleigh Men's Four |
271115 | 135571 | Tomasz Ireneusz ya | M | 34.0 | 185.0 | 96.0 | Poland | POL | 2002 Winter | 2002 | Winter | Salt Lake City | Bobsleigh | Bobsleigh Men's Four |
271116 rows × 14 columns
In [ ]: