Laith Saeed has successfully completed this project.

Data Cleaning with Pandas

hard

4.63

Data Cleaning Using the NASA Exoplanet Archive Dataset

Finished

July 27, 2025 3:24 AM

Elapsed time (min)

Completed activities

Resolution

Activities

Project.ipynb

Notebook

Data Cleaning Using the NASA Exoplanet Archive Dataset¶

In [96]:

import numpy as np
import pandas as pd

Understanding the Dataset¶

In [97]:

df = pd.read_csv('data/Planets list.csv')

In [98]:

df.head()

Out[98]:

	pl_hostname	pl_letter	pl_name	pl_discmethod	pl_pnum	pl_orbper	pl_orbpererr1	pl_orbpererr2	...	st_massn	st_rad	st_raderr1	st_raderr2	st_radn	rowupdate	pl_facility
0	HD 180314	b	HD 180314 b	Radial Velocity	1	396.0300	0.6200	-0.6200	...	4	8.13	0.30	-0.30	4	2018-09-06	Subaru Telescope
1	HD 180902	b	HD 180902 b	Radial Velocity	1	510.9000	1.5000	-1.5000	...	5	4.16	NaN	NaN	5	2018-12-06	W. M. Keck Observatory
2	HD 181342	b	HD 181342 b	Radial Velocity	1	564.1000	4.1000	-4.1000	...	6	4.71	NaN	NaN	6	2018-12-06	W. M. Keck Observatory
3	HD 181433	b	HD 181433 b	Radial Velocity	3	9.3743	0.0019	-0.0019	...	5	0.79	0.01	-0.01	4	2018-09-06	La Silla Observatory
4	HD 181433	c	HD 181433 c	Radial Velocity	3	962.0000	15.0000	-15.0000	...	5	0.79	0.01	-0.01	4	2014-07-23	La Silla Observatory

5 rows × 82 columns

In [99]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4367 entries, 0 to 4366
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   pl_hostname      4367 non-null   object 
 1   pl_letter        4367 non-null   object 
 2   pl_name          4367 non-null   object 
 3   pl_discmethod    4367 non-null   object 
 4   pl_controvflag   4367 non-null   int64  
 5   pl_pnum          4367 non-null   int64  
 6   pl_orbper        4223 non-null   float64
 7   pl_orbpererr1    4046 non-null   float64
 8   pl_orbpererr2    4046 non-null   float64
 9   pl_orbperlim     4244 non-null   float64
 10  pl_orbpern       4367 non-null   int64  
 11  pl_orbsmax       2618 non-null   float64
 12  pl_orbsmaxerr1   1811 non-null   float64
 13  pl_orbsmaxerr2   1810 non-null   float64
 14  pl_orbsmaxlim    2686 non-null   float64
 15  pl_orbsmaxn      4367 non-null   int64  
 16  pl_orbeccen      1592 non-null   float64
 17  pl_orbeccenerr1  1101 non-null   float64
 18  pl_orbeccenerr2  1101 non-null   float64
 19  pl_orbeccenlim   1694 non-null   float64
 20  pl_orbeccenn     4367 non-null   int64  
 21  pl_orbincl       1001 non-null   float64
 22  pl_orbinclerr1   964 non-null    float64
 23  pl_orbinclerr2   966 non-null    float64
 24  pl_orbincllim    1134 non-null   float64
 25  pl_orbincln      4367 non-null   int64  
 26  pl_bmassj        1846 non-null   float64
 27  pl_bmassjerr1    1695 non-null   float64
 28  pl_bmassjerr2    1695 non-null   float64
 29  pl_bmassjlim     1938 non-null   float64
 30  pl_bmassn        4367 non-null   int64  
 31  pl_bmassprov     1846 non-null   object 
 32  pl_radj          3350 non-null   float64
 33  pl_radjerr1      3269 non-null   float64
 34  pl_radjerr2      3269 non-null   float64
 35  pl_radjlim       3417 non-null   float64
 36  pl_radn          4367 non-null   int64  
 37  pl_dens          595 non-null    float64
 38  pl_denserr1      574 non-null    float64
 39  pl_denserr2      574 non-null    float64
 40  pl_denslim       777 non-null    float64
 41  pl_densn         4367 non-null   int64  
 42  pl_ttvflag       4367 non-null   int64  
 43  pl_kepflag       4367 non-null   int64  
 44  pl_k2flag        4367 non-null   int64  
 45  ra_str           4367 non-null   object 
 46  dec_str          4367 non-null   object 
 47  ra               4367 non-null   float64
 48  st_raerr         4131 non-null   float64
 49  dec              4367 non-null   float64
 50  st_decerr        4131 non-null   float64
 51  st_posn          4367 non-null   int64  
 52  st_dist          4357 non-null   float64
 53  st_disterr1      4281 non-null   float64
 54  st_disterr2      4281 non-null   float64
 55  st_distlim       4357 non-null   float64
 56  st_distn         4367 non-null   int64  
 57  st_optmag        4118 non-null   float64
 58  st_optmagerr     1090 non-null   float64
 59  st_optmaglim     4118 non-null   float64
 60  st_optband       4118 non-null   object 
 61  gaia_gmag        4188 non-null   float64
 62  gaia_gmagerr     0 non-null      float64
 63  gaia_gmaglim     4188 non-null   float64
 64  st_teff          4082 non-null   float64
 65  st_tefferr1      3913 non-null   float64
 66  st_tefferr2      3886 non-null   float64
 67  st_tefflim       4120 non-null   float64
 68  st_teffn         4367 non-null   int64  
 69  st_mass          3557 non-null   float64
 70  st_masserr1      3314 non-null   float64
 71  st_masserr2      3288 non-null   float64
 72  st_masslim       3602 non-null   float64
 73  st_massn         4367 non-null   int64  
 74  st_rad           3968 non-null   float64
 75  st_raderr1       3842 non-null   float64
 76  st_raderr2       3817 non-null   float64
 77  st_radlim        4008 non-null   float64
 78  st_radn          4367 non-null   int64  
 79  pl_nnotes        4367 non-null   int64  
 80  rowupdate        4367 non-null   object 
 81  pl_facility      4367 non-null   object 
dtypes: float64(54), int64(18), object(10)
memory usage: 2.7+ MB

In [100]:

df.describe()

Out[100]:

	pl_controvflag	pl_pnum	pl_orbper	pl_orbpererr1	pl_orbpererr2	pl_orbperlim	pl_orbpern	pl_orbsmax	pl_orbsmaxerr1	pl_orbsmaxerr2	...	st_masserr1	st_masserr2	st_masslim	st_massn	st_rad	st_raderr1	st_raderr2	st_radlim	st_radn	pl_nnotes
count	4367.000000	4367.000000	4.223000e+03	4.046000e+03	4.046000e+03	4244.000000	4367.000000	2618.000000	1811.000000	1810.000000	...	3314.000000	3288.000000	3602.0	4367.000000	3968.000000	3842.00000	3817.000000	4008.0	4367.000000	4367.000000
mean	0.005496	1.785894	2.036725e+03	9.410318e+02	-9.259957e+02	-0.000471	2.692695	8.672421	0.398626	-0.372598	...	0.113283	-0.108449	0.0	2.343943	1.541066	0.26018	-0.207165	0.0	4.301122	0.613236
std	0.073938	1.170802	1.123716e+05	5.738417e+04	5.738314e+04	0.037601	1.672416	117.067273	5.314798	5.307586	...	0.675172	0.677060	0.0	1.860910	3.886987	2.53607	1.903771	0.0	2.170420	0.540557
min	0.000000	1.000000	9.070629e-02	0.000000e+00	-3.650000e+06	-1.000000	0.000000	0.004400	0.000000	-200.000000	...	0.000000	-22.960000	0.0	0.000000	0.010000	0.00000	-73.630000	0.0	0.000000	0.000000
25%	0.000000	1.000000	4.464214e+00	1.602000e-05	-1.347500e-03	0.000000	1.000000	0.056000	0.000825	-0.038000	...	0.030000	-0.080000	0.0	1.000000	0.790000	0.03000	-0.140000	0.0	3.000000	0.000000
50%	0.000000	1.000000	1.172012e+01	1.000000e-04	-1.000000e-04	0.000000	3.000000	0.115000	0.003000	-0.003000	...	0.050000	-0.050000	0.0	2.000000	0.970000	0.07000	-0.060000	0.0	4.000000	1.000000
75%	0.000000	2.000000	4.142397e+01	1.300000e-03	-1.600000e-05	0.000000	3.000000	0.608367	0.037000	-0.000880	...	0.090000	-0.040000	0.0	3.000000	1.250000	0.22000	-0.030000	0.0	6.000000	1.000000
max	1.000000	8.000000	7.300000e+06	3.650000e+06	-0.000000e+00	1.000000	20.000000	3500.000000	200.000000	0.000000	...	22.960000	-0.000000	0.0	17.000000	83.800000	104.53000	-0.000000	0.0	21.000000	4.000000

8 rows × 72 columns

1. Find 5 nearest star systems using `st_dist`¶

In [101]:

# Sort by distance (ascending) and select top 5
closest_systems = df.sort_values('st_dist')[:5]
closest_systems

Out[101]:

	pl_hostname	pl_letter	pl_name	pl_discmethod	pl_pnum	pl_orbper	pl_orbpererr1	pl_orbpererr2	...	st_massn	st_rad	st_raderr1	st_raderr2	st_radn	pl_nnotes	rowupdate	pl_facility
2523	Proxima Cen	b	Proxima Cen b	Radial Velocity	1	11.186	0.001	-0.002	...	2	0.14	0.02	-0.02	2	0	2016-08-25	European Southern Observatory
4365	GJ 411	b	GJ 411 b	Radial Velocity	1	12.946	0.005	-0.005	...	3	0.39	0.00	-0.00	3	2	2021-03-04	Haute-Provence Observatory
375	eps Eri	b	eps Eri b	Radial Velocity	1	2690.000	30.000	-30.000	...	6	NaN	NaN	NaN	3	1	2020-09-03	Multiple Observatories
3685	GJ 887	b	GJ 887 b	Radial Velocity	2	9.262	0.001	-0.001	...	2	0.47	0.09	-0.09	2	1	2020-07-09	La Silla Observatory
3686	GJ 887	c	GJ 887 c	Radial Velocity	2	21.789	0.004	-0.005	...	2	0.47	0.09	-0.09	2	1	2020-07-09	La Silla Observatory

5 rows × 82 columns

2. Identify 3 hottest star systems using `st_teff`¶

In [102]:

# Sort by temperature (descending) and select top 3
# Sort by temperature (descending) and select top 3
hottest_systems = df.sort_values('st_teff', ascending = False)[:3]

3. What are the shortest and longest orbital periods among all planets in the dataset, as represented by the `pl_orbper` column?¶

In [103]:

# Find minimum and maximum orbital periods

4. Rank top 5 facilities by planet discoveries¶

In [104]:

# Count planets per facility and get top 5
top_facilities = df["pl_facility"].value_counts().head(5)
top_facilities

Out[104]:

pl_facility
Kepler                    2394
K2                         426
La Silla Observatory       247
W. M. Keck Observatory     177
Multiple Observatories     157
Name: count, dtype: int64

5. Identify Missing Data in `pl_orbeccen`¶

In [105]:

# Count missing values in pl_orbeccen
df["pl_orbeccen"].isnull().sum()

Out[105]:

6. Fill Missing `pl_bmassj` with Median¶

In [106]:

# Fill missing values with median
df.loc[:,'pl_bmassj'] = df['pl_bmassj'].fillna(df['pl_bmassj'].median())

7. Drop Rows with Missing `st_teff`¶

In [107]:

# Drop rows with missing st_teff
df = df.dropna(subset = "st_teff")

8. Investigate Missing Data Patterns in `pl_radj` and `pl_bmassj`¶

In [108]:

df[['pl_radj', 'pl_bmassj']].isnull().sum()

Out[108]:

pl_radj      805
pl_bmassj      0
dtype: int64

9. Handle Missing `pl_orbper` with Forward Fill¶

In [109]:

# Apply forward fill
df.loc[:,"pl_orbper"]=df['pl_orbper'].ffill()

10. Convert `pl_orbper` to Integer Days¶

In [110]:

# Convert pl_orbper to integer days
df["orbit_days"] = df["pl_orbper"].round().astype("int")

# Show the first few rows of the new column
df["orbit_days"].head(5)

Out[110]:

0    396
1    511
2    564
3      9
4    962
Name: orbit_days, dtype: int64

11. Create Yes/No Labels for `pl_kepflag`¶

In [111]:

# Convert pl_kepflag to Yes/No labels
df["kepler_detected"] = df['pl_kepflag'].map({1: "Yes", 0: "No"})

# Display the counts of Yes and No
df["kepler_detected"].head()

Out[111]:

0    No
1    No
2    No
3    No
4    No
Name: kepler_detected, dtype: object

12. Extract Year from `rowupdate`¶

In [112]:

# Convert rowupdate to datetime and extract the year
df['update_year'] = pd.to_datetime(df['rowupdate']).dt.year.astype("int64")
# Show the range of years
print(df['update_year'].max(), df['update_year'].min())

2021 2014

13. Simplify `pl_facility` Names¶

In [113]:

# Extract the first word from pl_facility
df['facility_short'] = df['pl_facility'].str.split(" ").str[0]
# Show the most common short facility names
df['facility_short'].value_counts().head()

Out[113]:

facility_short
Kepler      2356
K2           397
La           232
W.           166
Multiple     132
Name: count, dtype: int64

14. Spot Negative Orbits in `pl_orbper`¶

In [114]:

# Check for negative values in pl_orbper
df.loc[df['pl_orbper'] <0]

Out[114]:

	pl_hostname	pl_letter	pl_name	pl_discmethod	pl_controvflag	pl_pnum	pl_orbper	pl_orbpererr1	pl_orbpererr2	pl_orbperlim	...	st_raderr2	st_radlim	st_radn	pl_nnotes	rowupdate	pl_facility	orbit_days	kepler_detected	update_year	facility_short

0 rows × 86 columns

15. Validate Planet Masses in `pl_bmassj`¶

In [115]:

# Find planets with negative masses
df.loc[df['pl_bmassj'] <0]

Out[115]:

	pl_hostname	pl_letter	pl_name	pl_discmethod	pl_controvflag	pl_pnum	pl_orbper	pl_orbpererr1	pl_orbpererr2	pl_orbperlim	...	st_raderr2	st_radlim	st_radn	pl_nnotes	rowupdate	pl_facility	orbit_days	kepler_detected	update_year	facility_short

0 rows × 86 columns

16. Check `pl_discmethod` Categories¶

In [116]:

# List of expected discovery methods
ok_methods = ['Transit', 'Radial Velocity', 'Imaging', "Microlensing", "Astrometry"]
# Check for unexpected discovery methods
unexpected_methods = df.loc[~df['pl_discmethod'].isin(ok_methods)]

17. Find Duplicate Planet Entries¶

In [117]:

# Check for duplicate planet names
df[df['pl_name'].duplicated()]

Out[117]:

	pl_hostname	pl_letter	pl_name	pl_discmethod	pl_controvflag	pl_pnum	pl_orbper	pl_orbpererr1	pl_orbpererr2	pl_orbperlim	...	st_raderr2	st_radlim	st_radn	pl_nnotes	rowupdate	pl_facility	orbit_days	kepler_detected	update_year	facility_short

0 rows × 86 columns

18. Validate Star Temperature Range in `st_teff`¶

In [118]:

# Check for star temperatures outside the expected range
min_temp = 2000
max_temp = 40000
invalid_temps = df.loc[(df['st_teff']<min_temp) | (df["st_teff"] > max_temp)]

19. Categorize Planets by Size¶

In [119]:

# Create planet size categories
def categorize_size(radius):
    if pd.isna(radius):
        return "Unknown"
    elif radius < 0.5: 
        return "Small"
    elif (radius < 1.5) and (radius >= 0.5):
        return "Medium"
    elif radius >= 1.5: 
        return "Large"

df['planet_size'] = df['pl_radj'].apply(categorize_size)

20. Rename Columns for Clarity¶

In [124]:

# Rename columns
df = df.rename(columns = {"pl_hostname":"star_name", "pl_discmethod": "discovery_method", "pl_orbper":"orbital_period_days"})

21. Create a Simplified Dataset¶

In [127]:

# Select a subset of columns
simple_df = df[["pl_name", "star_name", "discovery_method", "orbital_period_days", "pl_radj","pl_bmassj"]]

22. Classify Orbital Periods¶

In [129]:

# Classify orbital periods
def classify_orbit(period):
    if pd.isna(period): 
        return "Unknown"
    elif period < 10: 
        return "Short"
    elif period < 100: 
        return "Medium"
    else: 
        return "Long"

df['orbit_type'] = df['orbital_period_days'].apply(classify_orbit)

23. Create a Binary Flag for Multi-Planet Systems¶

In [135]:

# Count planets per star
pl_count = df["star_name"].value_counts()
# Create binary flag for multi-planet systems
df['multi_planet_system'] = df['star_name'].map(pl_count > 1)
# Display the count of single and multi-planet systems
df['multi_planet_system'].value_counts()

Out[135]:

multi_planet_system
False    2338
True     1744
Name: count, dtype: int64

In [ ]:

Statement of Completion#9d4757e3

Data Cleaning with Pandas

Data Cleaning Using the NASA Exoplanet Archive Dataset

Data Cleaning Using the NASA Exoplanet Archive Dataset¶

Understanding the Dataset¶

1. Find 5 nearest star systems using st_dist¶

2. Identify 3 hottest star systems using st_teff¶

3. What are the shortest and longest orbital periods among all planets in the dataset, as represented by the pl_orbper column?¶

4. Rank top 5 facilities by planet discoveries¶

5. Identify Missing Data in pl_orbeccen¶

6. Fill Missing pl_bmassj with Median¶

7. Drop Rows with Missing st_teff¶

8. Investigate Missing Data Patterns in pl_radj and pl_bmassj¶

9. Handle Missing pl_orbper with Forward Fill¶

10. Convert pl_orbper to Integer Days¶

11. Create Yes/No Labels for pl_kepflag¶

12. Extract Year from rowupdate¶

13. Simplify pl_facility Names¶

14. Spot Negative Orbits in pl_orbper¶

15. Validate Planet Masses in pl_bmassj¶

16. Check pl_discmethod Categories¶

17. Find Duplicate Planet Entries¶

18. Validate Star Temperature Range in st_teff¶

19. Categorize Planets by Size¶

20. Rename Columns for Clarity¶

21. Create a Simplified Dataset¶

22. Classify Orbital Periods¶

23. Create a Binary Flag for Multi-Planet Systems¶

1. Find 5 nearest star systems using `st_dist`¶

2. Identify 3 hottest star systems using `st_teff`¶

3. What are the shortest and longest orbital periods among all planets in the dataset, as represented by the `pl_orbper` column?¶

5. Identify Missing Data in `pl_orbeccen`¶

6. Fill Missing `pl_bmassj` with Median¶

7. Drop Rows with Missing `st_teff`¶

8. Investigate Missing Data Patterns in `pl_radj` and `pl_bmassj`¶

9. Handle Missing `pl_orbper` with Forward Fill¶

10. Convert `pl_orbper` to Integer Days¶

11. Create Yes/No Labels for `pl_kepflag`¶

12. Extract Year from `rowupdate`¶

13. Simplify `pl_facility` Names¶

14. Spot Negative Orbits in `pl_orbper`¶

15. Validate Planet Masses in `pl_bmassj`¶

16. Check `pl_discmethod` Categories¶

18. Validate Star Temperature Range in `st_teff`¶