Statement of Completion#9d4757e3
Data Cleaning with Pandas
hard
Data Cleaning Using the NASA Exoplanet Archive Dataset
Resolution
Activities
Project.ipynb
Data Cleaning Using the NASA Exoplanet Archive Dataset¶
In [96]:
import numpy as np
import pandas as pd
Understanding the Dataset¶
In [97]:
df = pd.read_csv('data/Planets list.csv')
In [98]:
df.head()
Out[98]:
| pl_hostname | pl_letter | pl_name | pl_discmethod | pl_controvflag | pl_pnum | pl_orbper | pl_orbpererr1 | pl_orbpererr2 | pl_orbperlim | ... | st_masslim | st_massn | st_rad | st_raderr1 | st_raderr2 | st_radlim | st_radn | pl_nnotes | rowupdate | pl_facility | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HD 180314 | b | HD 180314 b | Radial Velocity | 0 | 1 | 396.0300 | 0.6200 | -0.6200 | 0.0 | ... | 0.0 | 4 | 8.13 | 0.30 | -0.30 | 0.0 | 4 | 0 | 2018-09-06 | Subaru Telescope |
| 1 | HD 180902 | b | HD 180902 b | Radial Velocity | 0 | 1 | 510.9000 | 1.5000 | -1.5000 | 0.0 | ... | 0.0 | 5 | 4.16 | NaN | NaN | 0.0 | 5 | 0 | 2018-12-06 | W. M. Keck Observatory |
| 2 | HD 181342 | b | HD 181342 b | Radial Velocity | 0 | 1 | 564.1000 | 4.1000 | -4.1000 | 0.0 | ... | 0.0 | 6 | 4.71 | NaN | NaN | 0.0 | 6 | 0 | 2018-12-06 | W. M. Keck Observatory |
| 3 | HD 181433 | b | HD 181433 b | Radial Velocity | 0 | 3 | 9.3743 | 0.0019 | -0.0019 | 0.0 | ... | 0.0 | 5 | 0.79 | 0.01 | -0.01 | 0.0 | 4 | 0 | 2018-09-06 | La Silla Observatory |
| 4 | HD 181433 | c | HD 181433 c | Radial Velocity | 0 | 3 | 962.0000 | 15.0000 | -15.0000 | 0.0 | ... | 0.0 | 5 | 0.79 | 0.01 | -0.01 | 0.0 | 4 | 0 | 2014-07-23 | La Silla Observatory |
5 rows × 82 columns
In [99]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4367 entries, 0 to 4366 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pl_hostname 4367 non-null object 1 pl_letter 4367 non-null object 2 pl_name 4367 non-null object 3 pl_discmethod 4367 non-null object 4 pl_controvflag 4367 non-null int64 5 pl_pnum 4367 non-null int64 6 pl_orbper 4223 non-null float64 7 pl_orbpererr1 4046 non-null float64 8 pl_orbpererr2 4046 non-null float64 9 pl_orbperlim 4244 non-null float64 10 pl_orbpern 4367 non-null int64 11 pl_orbsmax 2618 non-null float64 12 pl_orbsmaxerr1 1811 non-null float64 13 pl_orbsmaxerr2 1810 non-null float64 14 pl_orbsmaxlim 2686 non-null float64 15 pl_orbsmaxn 4367 non-null int64 16 pl_orbeccen 1592 non-null float64 17 pl_orbeccenerr1 1101 non-null float64 18 pl_orbeccenerr2 1101 non-null float64 19 pl_orbeccenlim 1694 non-null float64 20 pl_orbeccenn 4367 non-null int64 21 pl_orbincl 1001 non-null float64 22 pl_orbinclerr1 964 non-null float64 23 pl_orbinclerr2 966 non-null float64 24 pl_orbincllim 1134 non-null float64 25 pl_orbincln 4367 non-null int64 26 pl_bmassj 1846 non-null float64 27 pl_bmassjerr1 1695 non-null float64 28 pl_bmassjerr2 1695 non-null float64 29 pl_bmassjlim 1938 non-null float64 30 pl_bmassn 4367 non-null int64 31 pl_bmassprov 1846 non-null object 32 pl_radj 3350 non-null float64 33 pl_radjerr1 3269 non-null float64 34 pl_radjerr2 3269 non-null float64 35 pl_radjlim 3417 non-null float64 36 pl_radn 4367 non-null int64 37 pl_dens 595 non-null float64 38 pl_denserr1 574 non-null float64 39 pl_denserr2 574 non-null float64 40 pl_denslim 777 non-null float64 41 pl_densn 4367 non-null int64 42 pl_ttvflag 4367 non-null int64 43 pl_kepflag 4367 non-null int64 44 pl_k2flag 4367 non-null int64 45 ra_str 4367 non-null object 46 dec_str 4367 non-null object 47 ra 4367 non-null float64 48 st_raerr 4131 non-null float64 49 dec 4367 non-null float64 50 st_decerr 4131 non-null float64 51 st_posn 4367 non-null int64 52 st_dist 4357 non-null float64 53 st_disterr1 4281 non-null float64 54 st_disterr2 4281 non-null float64 55 st_distlim 4357 non-null float64 56 st_distn 4367 non-null int64 57 st_optmag 4118 non-null float64 58 st_optmagerr 1090 non-null float64 59 st_optmaglim 4118 non-null float64 60 st_optband 4118 non-null object 61 gaia_gmag 4188 non-null float64 62 gaia_gmagerr 0 non-null float64 63 gaia_gmaglim 4188 non-null float64 64 st_teff 4082 non-null float64 65 st_tefferr1 3913 non-null float64 66 st_tefferr2 3886 non-null float64 67 st_tefflim 4120 non-null float64 68 st_teffn 4367 non-null int64 69 st_mass 3557 non-null float64 70 st_masserr1 3314 non-null float64 71 st_masserr2 3288 non-null float64 72 st_masslim 3602 non-null float64 73 st_massn 4367 non-null int64 74 st_rad 3968 non-null float64 75 st_raderr1 3842 non-null float64 76 st_raderr2 3817 non-null float64 77 st_radlim 4008 non-null float64 78 st_radn 4367 non-null int64 79 pl_nnotes 4367 non-null int64 80 rowupdate 4367 non-null object 81 pl_facility 4367 non-null object dtypes: float64(54), int64(18), object(10) memory usage: 2.7+ MB
In [100]:
df.describe()
Out[100]:
| pl_controvflag | pl_pnum | pl_orbper | pl_orbpererr1 | pl_orbpererr2 | pl_orbperlim | pl_orbpern | pl_orbsmax | pl_orbsmaxerr1 | pl_orbsmaxerr2 | ... | st_masserr1 | st_masserr2 | st_masslim | st_massn | st_rad | st_raderr1 | st_raderr2 | st_radlim | st_radn | pl_nnotes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4367.000000 | 4367.000000 | 4.223000e+03 | 4.046000e+03 | 4.046000e+03 | 4244.000000 | 4367.000000 | 2618.000000 | 1811.000000 | 1810.000000 | ... | 3314.000000 | 3288.000000 | 3602.0 | 4367.000000 | 3968.000000 | 3842.00000 | 3817.000000 | 4008.0 | 4367.000000 | 4367.000000 |
| mean | 0.005496 | 1.785894 | 2.036725e+03 | 9.410318e+02 | -9.259957e+02 | -0.000471 | 2.692695 | 8.672421 | 0.398626 | -0.372598 | ... | 0.113283 | -0.108449 | 0.0 | 2.343943 | 1.541066 | 0.26018 | -0.207165 | 0.0 | 4.301122 | 0.613236 |
| std | 0.073938 | 1.170802 | 1.123716e+05 | 5.738417e+04 | 5.738314e+04 | 0.037601 | 1.672416 | 117.067273 | 5.314798 | 5.307586 | ... | 0.675172 | 0.677060 | 0.0 | 1.860910 | 3.886987 | 2.53607 | 1.903771 | 0.0 | 2.170420 | 0.540557 |
| min | 0.000000 | 1.000000 | 9.070629e-02 | 0.000000e+00 | -3.650000e+06 | -1.000000 | 0.000000 | 0.004400 | 0.000000 | -200.000000 | ... | 0.000000 | -22.960000 | 0.0 | 0.000000 | 0.010000 | 0.00000 | -73.630000 | 0.0 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 1.000000 | 4.464214e+00 | 1.602000e-05 | -1.347500e-03 | 0.000000 | 1.000000 | 0.056000 | 0.000825 | -0.038000 | ... | 0.030000 | -0.080000 | 0.0 | 1.000000 | 0.790000 | 0.03000 | -0.140000 | 0.0 | 3.000000 | 0.000000 |
| 50% | 0.000000 | 1.000000 | 1.172012e+01 | 1.000000e-04 | -1.000000e-04 | 0.000000 | 3.000000 | 0.115000 | 0.003000 | -0.003000 | ... | 0.050000 | -0.050000 | 0.0 | 2.000000 | 0.970000 | 0.07000 | -0.060000 | 0.0 | 4.000000 | 1.000000 |
| 75% | 0.000000 | 2.000000 | 4.142397e+01 | 1.300000e-03 | -1.600000e-05 | 0.000000 | 3.000000 | 0.608367 | 0.037000 | -0.000880 | ... | 0.090000 | -0.040000 | 0.0 | 3.000000 | 1.250000 | 0.22000 | -0.030000 | 0.0 | 6.000000 | 1.000000 |
| max | 1.000000 | 8.000000 | 7.300000e+06 | 3.650000e+06 | -0.000000e+00 | 1.000000 | 20.000000 | 3500.000000 | 200.000000 | 0.000000 | ... | 22.960000 | -0.000000 | 0.0 | 17.000000 | 83.800000 | 104.53000 | -0.000000 | 0.0 | 21.000000 | 4.000000 |
8 rows × 72 columns
1. Find 5 nearest star systems using st_dist¶
In [101]:
# Sort by distance (ascending) and select top 5
closest_systems = df.sort_values('st_dist')[:5]
closest_systems
Out[101]:
| pl_hostname | pl_letter | pl_name | pl_discmethod | pl_controvflag | pl_pnum | pl_orbper | pl_orbpererr1 | pl_orbpererr2 | pl_orbperlim | ... | st_masslim | st_massn | st_rad | st_raderr1 | st_raderr2 | st_radlim | st_radn | pl_nnotes | rowupdate | pl_facility | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2523 | Proxima Cen | b | Proxima Cen b | Radial Velocity | 0 | 1 | 11.186 | 0.001 | -0.002 | 0.0 | ... | 0.0 | 2 | 0.14 | 0.02 | -0.02 | 0.0 | 2 | 0 | 2016-08-25 | European Southern Observatory |
| 4365 | GJ 411 | b | GJ 411 b | Radial Velocity | 0 | 1 | 12.946 | 0.005 | -0.005 | 0.0 | ... | 0.0 | 3 | 0.39 | 0.00 | -0.00 | 0.0 | 3 | 2 | 2021-03-04 | Haute-Provence Observatory |
| 375 | eps Eri | b | eps Eri b | Radial Velocity | 0 | 1 | 2690.000 | 30.000 | -30.000 | 0.0 | ... | 0.0 | 6 | NaN | NaN | NaN | 0.0 | 3 | 1 | 2020-09-03 | Multiple Observatories |
| 3685 | GJ 887 | b | GJ 887 b | Radial Velocity | 0 | 2 | 9.262 | 0.001 | -0.001 | 0.0 | ... | 0.0 | 2 | 0.47 | 0.09 | -0.09 | 0.0 | 2 | 1 | 2020-07-09 | La Silla Observatory |
| 3686 | GJ 887 | c | GJ 887 c | Radial Velocity | 0 | 2 | 21.789 | 0.004 | -0.005 | 0.0 | ... | 0.0 | 2 | 0.47 | 0.09 | -0.09 | 0.0 | 2 | 1 | 2020-07-09 | La Silla Observatory |
5 rows × 82 columns
2. Identify 3 hottest star systems using st_teff¶
In [102]:
# Sort by temperature (descending) and select top 3
# Sort by temperature (descending) and select top 3
hottest_systems = df.sort_values('st_teff', ascending = False)[:3]
3. What are the shortest and longest orbital periods among all planets in the dataset, as represented by the pl_orbper column?¶
In [103]:
# Find minimum and maximum orbital periods
4. Rank top 5 facilities by planet discoveries¶
In [104]:
# Count planets per facility and get top 5
top_facilities = df["pl_facility"].value_counts().head(5)
top_facilities
Out[104]:
pl_facility Kepler 2394 K2 426 La Silla Observatory 247 W. M. Keck Observatory 177 Multiple Observatories 157 Name: count, dtype: int64
5. Identify Missing Data in pl_orbeccen¶
In [105]:
# Count missing values in pl_orbeccen
df["pl_orbeccen"].isnull().sum()
Out[105]:
2775
6. Fill Missing pl_bmassj with Median¶
In [106]:
# Fill missing values with median
df.loc[:,'pl_bmassj'] = df['pl_bmassj'].fillna(df['pl_bmassj'].median())
7. Drop Rows with Missing st_teff¶
In [107]:
# Drop rows with missing st_teff
df = df.dropna(subset = "st_teff")
8. Investigate Missing Data Patterns in pl_radj and pl_bmassj¶
In [108]:
df[['pl_radj', 'pl_bmassj']].isnull().sum()
Out[108]:
pl_radj 805 pl_bmassj 0 dtype: int64
9. Handle Missing pl_orbper with Forward Fill¶
In [109]:
# Apply forward fill
df.loc[:,"pl_orbper"]=df['pl_orbper'].ffill()
10. Convert pl_orbper to Integer Days¶
In [110]:
# Convert pl_orbper to integer days
df["orbit_days"] = df["pl_orbper"].round().astype("int")
# Show the first few rows of the new column
df["orbit_days"].head(5)
Out[110]:
0 396 1 511 2 564 3 9 4 962 Name: orbit_days, dtype: int64
11. Create Yes/No Labels for pl_kepflag¶
In [111]:
# Convert pl_kepflag to Yes/No labels
df["kepler_detected"] = df['pl_kepflag'].map({1: "Yes", 0: "No"})
# Display the counts of Yes and No
df["kepler_detected"].head()
Out[111]:
0 No 1 No 2 No 3 No 4 No Name: kepler_detected, dtype: object
12. Extract Year from rowupdate¶
In [112]:
# Convert rowupdate to datetime and extract the year
df['update_year'] = pd.to_datetime(df['rowupdate']).dt.year.astype("int64")
# Show the range of years
print(df['update_year'].max(), df['update_year'].min())
2021 2014
13. Simplify pl_facility Names¶
In [113]:
# Extract the first word from pl_facility
df['facility_short'] = df['pl_facility'].str.split(" ").str[0]
# Show the most common short facility names
df['facility_short'].value_counts().head()
Out[113]:
facility_short Kepler 2356 K2 397 La 232 W. 166 Multiple 132 Name: count, dtype: int64
14. Spot Negative Orbits in pl_orbper¶
In [114]:
# Check for negative values in pl_orbper
df.loc[df['pl_orbper'] <0]
Out[114]:
| pl_hostname | pl_letter | pl_name | pl_discmethod | pl_controvflag | pl_pnum | pl_orbper | pl_orbpererr1 | pl_orbpererr2 | pl_orbperlim | ... | st_raderr2 | st_radlim | st_radn | pl_nnotes | rowupdate | pl_facility | orbit_days | kepler_detected | update_year | facility_short |
|---|
0 rows × 86 columns
15. Validate Planet Masses in pl_bmassj¶
In [115]:
# Find planets with negative masses
df.loc[df['pl_bmassj'] <0]
Out[115]:
| pl_hostname | pl_letter | pl_name | pl_discmethod | pl_controvflag | pl_pnum | pl_orbper | pl_orbpererr1 | pl_orbpererr2 | pl_orbperlim | ... | st_raderr2 | st_radlim | st_radn | pl_nnotes | rowupdate | pl_facility | orbit_days | kepler_detected | update_year | facility_short |
|---|
0 rows × 86 columns
16. Check pl_discmethod Categories¶
In [116]:
# List of expected discovery methods
ok_methods = ['Transit', 'Radial Velocity', 'Imaging', "Microlensing", "Astrometry"]
# Check for unexpected discovery methods
unexpected_methods = df.loc[~df['pl_discmethod'].isin(ok_methods)]
17. Find Duplicate Planet Entries¶
In [117]:
# Check for duplicate planet names
df[df['pl_name'].duplicated()]
Out[117]:
| pl_hostname | pl_letter | pl_name | pl_discmethod | pl_controvflag | pl_pnum | pl_orbper | pl_orbpererr1 | pl_orbpererr2 | pl_orbperlim | ... | st_raderr2 | st_radlim | st_radn | pl_nnotes | rowupdate | pl_facility | orbit_days | kepler_detected | update_year | facility_short |
|---|
0 rows × 86 columns
18. Validate Star Temperature Range in st_teff¶
In [118]:
# Check for star temperatures outside the expected range
min_temp = 2000
max_temp = 40000
invalid_temps = df.loc[(df['st_teff']<min_temp) | (df["st_teff"] > max_temp)]
19. Categorize Planets by Size¶
In [119]:
# Create planet size categories
def categorize_size(radius):
if pd.isna(radius):
return "Unknown"
elif radius < 0.5:
return "Small"
elif (radius < 1.5) and (radius >= 0.5):
return "Medium"
elif radius >= 1.5:
return "Large"
df['planet_size'] = df['pl_radj'].apply(categorize_size)
20. Rename Columns for Clarity¶
In [124]:
# Rename columns
df = df.rename(columns = {"pl_hostname":"star_name", "pl_discmethod": "discovery_method", "pl_orbper":"orbital_period_days"})
21. Create a Simplified Dataset¶
In [127]:
# Select a subset of columns
simple_df = df[["pl_name", "star_name", "discovery_method", "orbital_period_days", "pl_radj","pl_bmassj"]]
22. Classify Orbital Periods¶
In [129]:
# Classify orbital periods
def classify_orbit(period):
if pd.isna(period):
return "Unknown"
elif period < 10:
return "Short"
elif period < 100:
return "Medium"
else:
return "Long"
df['orbit_type'] = df['orbital_period_days'].apply(classify_orbit)
23. Create a Binary Flag for Multi-Planet Systems¶
In [135]:
# Count planets per star
pl_count = df["star_name"].value_counts()
# Create binary flag for multi-planet systems
df['multi_planet_system'] = df['star_name'].map(pl_count > 1)
# Display the count of single and multi-planet systems
df['multi_planet_system'].value_counts()
Out[135]:
multi_planet_system False 2338 True 1744 Name: count, dtype: int64
In [ ]: