Joel Talemwa has successfully completed this project.

Data Cleaning with Pandas

easy

4.55

Data Cleaning Capstone: Cleaning NYC Airbnb Data

Finished

May 4, 2025 7:58 AM

Elapsed time (min)

Completed activities

Resolution

Activities

Project.ipynb

Notebook

Importing the libraries¶

In [1]:

#to read the dataframe
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

Reading the dataset as a dataframe¶

In [2]:

# Reading dataframe
df = pd.read_csv('AB_NYC.csv')

In [3]:

df.head()

Out[3]:

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149.0	1	9	2018-10-19	0.21	6	365
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225.0	1	45	2019-05-21	0.38	2	355
2	3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150.0	3	0	NaN	NaN	1	365
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89.0	1	270	2019-07-05	4.64	1	194
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80.0	10	9	2018-11-19	0.10	1	0

Discovering the dataset¶

In [4]:

#show the number of cells in the dataframe
print("dataset size: ", df.size)

#show the number of records (rows) in the dataframe
print("number of hosts: ", len(df))

#show the number of features (coulmns) in the dataframe
print("number of features: ", len(df.columns))

dataset size:  782320
number of hosts:  48895
number of features:  16

In [ ]:

df.dtypes

Activities¶

Let's clean the dataset.

1. Find out the missing values in each column¶

In [5]:

col_missing_values = df.isna().sum()

2. Drop the column `reviews_per_month` as it has many values missing and we will not use it¶

In [7]:

df = df.drop(columns='reviews_per_month',axis=1)

3. Drop the rows having more than 1 missing value¶

In [9]:

df_rows_dropped = df.dropna(thresh=df.shape[1]-1)

4. Fill the 21 missing values in `host_name` with the value `Airbnb`¶

In [11]:

host_total = df['host_name'].fillna('Airbnb')

5. Check if any name in the Column host_name has digit(s) or number(s) in it¶

In [15]:

df['host_name'].str.isdigit().sum()
df['host_name'].str.isalpha().sum()

Out[15]:

6. Fill the 2 missing values of the column `price` with the mean of the column¶

In [16]:

mean_price = df['price'].mean()
mean_df_price = df['price'].fillna(mean_price)

7. Fill all missing values in `last_review` using forward filling method¶

In [18]:

ffill_review = df['last_review'].fillna(method='ffill')

8. Select duplicate hosts in a dataframe based on `name`, `host_id`, and `price` columns¶

In [21]:

duplicate_hosts = df[df.duplicated(subset=['name', 'host_id', 'price'], keep=False)]

10. Drop duplicates while keeping the first non-NaN value based on `name`, `host_id`, and `price` columns¶

In [23]:

df_unique_hosts = df.drop_duplicates(subset=['name', 'host_id', 'price'],keep='first')

11. How many users in the Column `room_type` are `Private room`?¶

In [27]:

private_room_counts = df['room_type'].str.count('Private room')
private_room_counts = private_room_counts.sum()
private_room_counts

Out[27]:

12. Find the words in Column `name` which contain the substring `park`¶

In [32]:

names_having_park =df['name'].str.contains('park')
names_having_park.sum()

Out[32]:

13. Replace the neighbourhood having `Kitchen` with `Restaurant`¶

In [33]:

kitchen_to_restaurant =df['neighbourhood'].str.replace('Kitchen','Restaurant')

14. Split the strings in the `room_type` column at (space) to find whether it is room or home/apt¶

In [36]:

roomOrhome = df['room_type'].str.split(' ').str[1]

15. Clean the column `availability_365` by removing invalid values¶

In [38]:

df_invalid_availability_365 = df.loc[df['availability_365']== 0]

17. Identify outliers in `minimum_nights` column¶

In [40]:

Nights_plot = df['minimum_nights'].plot(kind='hist')
Nights_plot.set_yscale('log')
Nights_plot.set_xlabel('minimum_nights')
Nights_plot.set_ylabel('Frequency')

Out[40]:

Text(0, 0.5, 'Frequency')

No description has been provided for this image

In [41]:

df_nights = df.copy()
df_nights['Min_Nights_cleaned'] = ((df_nights['minimum_nights'] - df_nights['minimum_nights'].mean()) /  df_nights['minimum_nights'].std()).abs() >= 43

18. Identify outliers in `price` column¶

In [43]:

df_Price = df.copy()
Price_plot = df_Price['price'].plot(kind='box')
Price_plot.set_yscale('log')

In [45]:

>>> Q1 = df_Price['price'].quantile(0.25)
>>> Q3 = df_Price['price'].quantile(0.75)
>>> IQR = Q3 - Q1
>>> df_Price['Price_cleaned'] = ((df_Price['price'] < Q1 - 1.5 * IQR) | (df_Price['price'] > Q3 + 1.5 * IQR))

In [ ]:

Statement of Completion#ea6d83b1

Data Cleaning with Pandas

Data Cleaning Capstone: Cleaning NYC Airbnb Data

Importing the libraries¶

Reading the dataset as a dataframe¶

Discovering the dataset¶

Activities¶

1. Find out the missing values in each column¶

2. Drop the column reviews_per_month as it has many values missing and we will not use it¶

3. Drop the rows having more than 1 missing value¶

4. Fill the 21 missing values in host_name with the value Airbnb¶

5. Check if any name in the Column host_name has digit(s) or number(s) in it¶

6. Fill the 2 missing values of the column price with the mean of the column¶

7. Fill all missing values in last_review using forward filling method¶

8. Select duplicate hosts in a dataframe based on name, host_id, and price columns¶

10. Drop duplicates while keeping the first non-NaN value based on name, host_id, and price columns¶

11. How many users in the Column room_type are Private room?¶

12. Find the words in Column name which contain the substring park¶

13. Replace the neighbourhood having Kitchen with Restaurant¶

14. Split the strings in the room_type column at (space) to find whether it is room or home/apt¶

15. Clean the column availability_365 by removing invalid values¶

17. Identify outliers in minimum_nights column¶

18. Identify outliers in price column¶