Statement of Completion#0f73580e
Data Cleaning with Pandas
easy
Handling Null Values: Athletes Events Data
Resolution
Activities
Project.ipynb
In [1]:
import numpy as np
import pandas as pd
In [2]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
df = pd.read_csv('athlete_events_final.csv')
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 271116 entries, 0 to 271115 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 271116 non-null int64 1 Name 271116 non-null object 2 Sex 271116 non-null object 3 Age 261642 non-null float64 4 Height 210945 non-null float64 5 Weight 208241 non-null float64 6 Team 271116 non-null object 7 NOC 271111 non-null object 8 Games 271116 non-null object 9 Year 271116 non-null int64 10 Season 271116 non-null object 11 City 271116 non-null object 12 Sport 271116 non-null object 13 Event 271116 non-null object 14 Medal 39783 non-null object dtypes: float64(3), int64(2), object(10) memory usage: 31.0+ MB
In [5]:
df.describe()
Out[5]:
| ID | Age | Height | Weight | Year | |
|---|---|---|---|---|---|
| count | 271116.000000 | 261642.000000 | 210945.000000 | 208241.000000 | 271116.000000 |
| mean | 68248.954396 | 25.556898 | 175.338970 | 70.702393 | 1978.378480 |
| std | 39022.286345 | 6.393561 | 10.518462 | 14.348020 | 29.877632 |
| min | 1.000000 | 10.000000 | 127.000000 | 25.000000 | 1896.000000 |
| 25% | 34643.000000 | 21.000000 | 168.000000 | 60.000000 | 1960.000000 |
| 50% | 68205.000000 | 24.000000 | 175.000000 | 70.000000 | 1988.000000 |
| 75% | 102097.250000 | 28.000000 | 183.000000 | 79.000000 | 2002.000000 |
| max | 135571.000000 | 97.000000 | 226.000000 | 214.000000 | 2016.000000 |
1. What is the code that we use to get the count of null values column-wise?¶
In [ ]:
2. Find and Store Null Values in Each Column¶
In [10]:
null_count = df.isnull().sum()
print(null_count)
... # Enter your code here
ID 0 Name 0 Sex 0 Age 9474 Height 60171 Weight 62875 Team 0 NOC 5 Games 0 Year 0 Season 0 City 0 Sport 0 Event 0 Medal 231333 dtype: int64
Out[10]:
Ellipsis
3. Which of the following columns doesn't contain any missing values?¶
In [ ]:
4. What is the column that has the second highest number of null values in the DataFrame df ?¶
In [ ]:
5. Calculate Percentage of Null Values in Each Column¶
In [13]:
null_percentage_df = (df.isnull().sum()/df.shape[0]) *100
print(null_percentage_df)
ID 0.000000 Name 0.000000 Sex 0.000000 Age 3.494445 Height 22.193821 Weight 23.191180 Team 0.000000 NOC 0.001844 Games 0.000000 Year 0.000000 Season 0.000000 City 0.000000 Sport 0.000000 Event 0.000000 Medal 85.326207 dtype: float64
6. What is the column that has the highest percentage of missing values in the DataFramedf?¶
In [ ]:
7. Which function is used to drop rows that contain missing values?¶
In [ ]:
8. Remove Columns with More Than 50% Missing Values¶
In [14]:
threshold = len(df) * 0.5
df = df.dropna(axis = 1, thresh = threshold )
9. Drop All Rows with Null Values¶
In [18]:
df_no_null = df.dropna(axis = 0).reset_index(drop = True)
10. Which of the follwoing imputation method is most suitable for filling missing data in continuous numerical columns?¶
In [ ]:
11. Which of the following is the best imputation to use for normally distributed data without outliers?¶
In [ ]:
12. Which method is used to fill missing values with the mean of a column?¶
In [ ]:
13. What does interpolation do when handling missing data?¶
In [ ]:
14. Fill Missing Values with Median¶
In [20]:
df["Age"].fillna(df["Age"].median(), inplace = True)
df["Height"].fillna(df["Height"].median(), inplace = True)
df["Weight"].fillna(df['Weight'].median(), inplace = True)
You can see below that there are 5 missing values in the NOC (National Olympic Committee) column. Now, let's handle missing values in that column.
Since there are only 5 missing values in that column, we don't need to use imputation techniques unnecessarily. Instead, we can simply enter the NOCs of the teams, which is quite straightforward. The NOC for the United States is USA, Japan is JPN, Greece is GRE, Italy is ITA, and Bulgaria is BUL.
15. Fill Null Values in the NOC Column¶
In [22]:
df['NOC'] = df['NOC'].fillna(df['Team'].map({
"United States":"USA",
"Japan" :"JPN",
"Greece":"GRE",
"Italy":"ITA",
"Bulgaria": 'BUL'
}))
In [24]:
df.head()
Out[24]:
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight |
| 2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | 175.0 | 70.0 | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football |
| 3 | 4 | Edgar Lindenau Aabye | M | 34.0 | 175.0 | 70.0 | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War |
| 4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres |
In [ ]: