In [31]:

import numpy as np
import pandas as pd

In [32]:

import warnings
warnings.filterwarnings('ignore')

In [33]:

df = pd.read_csv('athlete_events_final.csv')

In [34]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271111 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB

In [35]:

df.describe()

Out[35]:

	ID	Age	Height	Weight	Year
count	271116.000000	261642.000000	210945.000000	208241.000000	271116.000000
mean	68248.954396	25.556898	175.338970	70.702393	1978.378480
std	39022.286345	6.393561	10.518462	14.348020	29.877632
min	1.000000	10.000000	127.000000	25.000000	1896.000000
25%	34643.000000	21.000000	168.000000	60.000000	1960.000000
50%	68205.000000	24.000000	175.000000	70.000000	1988.000000
75%	102097.250000	28.000000	183.000000	79.000000	2002.000000
max	135571.000000	97.000000	226.000000	214.000000	2016.000000

1. What is the code that we use to get the count of null values column-wise?¶

In [36]:

df.isnull().sum()

Out[36]:

ID             0
Name           0
Sex            0
Age         9474
Height     60171
Weight     62875
Team           0
NOC            5
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     231333
dtype: int64

2. Find and Store Null Values in Each Column¶

In [37]:

null_count = df.isnull().sum() # Enter your code here

3. Which of the following columns doesn't contain any missing values?¶

In [ ]:

4. What is the column that has the second highest number of null values in the DataFrame `df` ?¶

In [38]:

print(null_count)

ID             0
Name           0
Sex            0
Age         9474
Height     60171
Weight     62875
Team           0
NOC            5
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     231333
dtype: int64

5. Calculate Percentage of Null Values in Each Column¶

In [39]:

null_percentage_df = (df.isnull().sum()/len(df)) * 100 # Enter your code here

6. What is the column that has the highest percentage of missing values in the DataFrame`df`?¶

In [ ]:

7. Which function is used to drop rows that contain missing values?¶

In [ ]:

8. Remove Columns with More Than `50%` Missing Values¶

In [40]:

threshold = len(df) * 0.5
df = df.dropna(axis=1, thresh=threshold)

9. Drop All Rows with Null Values¶

In [41]:

df_no_null = df.dropna() # Enter your code here
df_no_null = df_no_null.reset_index(drop=True)

10. Which of the follwoing imputation method is most suitable for filling missing data in continuous numerical columns?¶

In [ ]:

11. Which of the following is the best imputation to use for normally distributed data without outliers?¶

In [ ]:

12. Which method is used to fill missing values with the mean of a column?¶

In [ ]:

13. What does interpolation do when handling missing data?¶

In [ ]:

14. Fill Missing Values with Median¶

In [42]:

df['Age'].fillna(df['Age'].median(), inplace=True)
df['Height'].fillna(df['Height'].median(), inplace=True)
df['Weight'].fillna(df['Weight'].median(), inplace=True)

You can see below that there are 5 missing values in the NOC (National Olympic Committee) column. Now, let's handle missing values in that column.

Since there are only 5 missing values in that column, we don't need to use imputation techniques unnecessarily. Instead, we can simply enter the NOCs of the teams, which is quite straightforward. The NOC for the United States is USA, Japan is JPN, Greece is GRE, Italy is ITA, and Bulgaria is BUL.

15. Fill Null Values in the `NOC` Column¶

In [46]:

df['NOC'] = df['NOC'].fillna(df['Team'].map({
    'United States': 'USA',
    'Japan': 'JPN',
    'Greece': 'GRE',
    'Italy': 'ITA',
    'Bulgaria': 'BUL'
}))
df

Out[46]:

	ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event
0	1	A Dijiang	M	24.0	180.0	80.0	China	CHN	1992 Summer	1992	Summer	Barcelona	Basketball	Basketball Men's Basketball
1	2	A Lamusi	M	23.0	170.0	60.0	China	CHN	2012 Summer	2012	Summer	London	Judo	Judo Men's Extra-Lightweight
2	3	Gunnar Nielsen Aaby	M	24.0	175.0	70.0	Denmark	DEN	1920 Summer	1920	Summer	Antwerpen	Football	Football Men's Football
3	4	Edgar Lindenau Aabye	M	34.0	175.0	70.0	Denmark/Sweden	DEN	1900 Summer	1900	Summer	Paris	Tug-Of-War	Tug-Of-War Men's Tug-Of-War
4	5	Christine Jacoba Aaftink	F	21.0	185.0	82.0	Netherlands	NED	1988 Winter	1988	Winter	Calgary	Speed Skating	Speed Skating Women's 500 metres
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
271111	135569	Andrzej ya	M	29.0	179.0	89.0	Poland-1	POL	1976 Winter	1976	Winter	Innsbruck	Luge	Luge Mixed (Men)'s Doubles
271112	135570	Piotr ya	M	27.0	176.0	59.0	Poland	POL	2014 Winter	2014	Winter	Sochi	Ski Jumping	Ski Jumping Men's Large Hill, Individual
271113	135570	Piotr ya	M	27.0	176.0	59.0	Poland	POL	2014 Winter	2014	Winter	Sochi	Ski Jumping	Ski Jumping Men's Large Hill, Team
271114	135571	Tomasz Ireneusz ya	M	30.0	185.0	96.0	Poland	POL	1998 Winter	1998	Winter	Nagano	Bobsleigh	Bobsleigh Men's Four
271115	135571	Tomasz Ireneusz ya	M	34.0	185.0	96.0	Poland	POL	2002 Winter	2002	Winter	Salt Lake City	Bobsleigh	Bobsleigh Men's Four

271116 rows × 14 columns

In [ ]:

Statement of Completion#4e91be43

Data Cleaning with Pandas

Handling Null Values: Athletes Events Data

1. What is the code that we use to get the count of null values column-wise?¶

2. Find and Store Null Values in Each Column¶

3. Which of the following columns doesn't contain any missing values?¶

4. What is the column that has the second highest number of null values in the DataFrame `df` ?¶

5. Calculate Percentage of Null Values in Each Column¶

6. What is the column that has the highest percentage of missing values in the DataFrame`df`?¶

7. Which function is used to drop rows that contain missing values?¶

8. Remove Columns with More Than `50%` Missing Values¶

9. Drop All Rows with Null Values¶

10. Which of the follwoing imputation method is most suitable for filling missing data in continuous numerical columns?¶

11. Which of the following is the best imputation to use for normally distributed data without outliers?¶

12. Which method is used to fill missing values with the mean of a column?¶

13. What does interpolation do when handling missing data?¶

14. Fill Missing Values with Median¶

15. Fill Null Values in the `NOC` Column¶

Statement of Completion#4e91be43

Data Cleaning with Pandas

Handling Null Values: Athletes Events Data

1. What is the code that we use to get the count of null values column-wise?¶

2. Find and Store Null Values in Each Column¶

3. Which of the following columns doesn't contain any missing values?¶

4. What is the column that has the second highest number of null values in the DataFrame df ?¶

5. Calculate Percentage of Null Values in Each Column¶

6. What is the column that has the highest percentage of missing values in the DataFramedf?¶

7. Which function is used to drop rows that contain missing values?¶

8. Remove Columns with More Than 50% Missing Values¶

9. Drop All Rows with Null Values¶

10. Which of the follwoing imputation method is most suitable for filling missing data in continuous numerical columns?¶

11. Which of the following is the best imputation to use for normally distributed data without outliers?¶

12. Which method is used to fill missing values with the mean of a column?¶

13. What does interpolation do when handling missing data?¶

14. Fill Missing Values with Median¶

15. Fill Null Values in the NOC Column¶

4. What is the column that has the second highest number of null values in the DataFrame `df` ?¶

6. What is the column that has the highest percentage of missing values in the DataFrame`df`?¶

8. Remove Columns with More Than `50%` Missing Values¶

15. Fill Null Values in the `NOC` Column¶