Statement of Completion#ea6d83b1
Data Cleaning with Pandas
easy
Data Cleaning Capstone: Cleaning NYC Airbnb Data
Resolution
Activities
Project.ipynb
Importing the libraries¶
In [1]:
#to read the dataframe
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
Reading the dataset as a dataframe¶
In [2]:
# Reading dataframe
df = pd.read_csv('AB_NYC.csv')
In [3]:
df.head()
Out[3]:
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149.0 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225.0 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150.0 | 3 | 0 | NaN | NaN | 1 | 365 |
3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89.0 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80.0 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
Discovering the dataset¶
In [4]:
#show the number of cells in the dataframe
print("dataset size: ", df.size)
#show the number of records (rows) in the dataframe
print("number of hosts: ", len(df))
#show the number of features (coulmns) in the dataframe
print("number of features: ", len(df.columns))
dataset size: 782320 number of hosts: 48895 number of features: 16
In [ ]:
df.dtypes
Activities¶
Let's clean the dataset.
1. Find out the missing values in each column¶
In [5]:
col_missing_values = df.isna().sum()
2. Drop the column reviews_per_month
as it has many values missing and we will not use it¶
In [7]:
df = df.drop(columns='reviews_per_month',axis=1)
3. Drop the rows having more than 1 missing value¶
In [9]:
df_rows_dropped = df.dropna(thresh=df.shape[1]-1)
4. Fill the 21 missing values in host_name
with the value Airbnb
¶
In [11]:
host_total = df['host_name'].fillna('Airbnb')
5. Check if any name in the Column host_name has digit(s) or number(s) in it¶
In [15]:
df['host_name'].str.isdigit().sum()
df['host_name'].str.isalpha().sum()
Out[15]:
6
6. Fill the 2 missing values of the column price
with the mean of the column¶
In [16]:
mean_price = df['price'].mean()
mean_df_price = df['price'].fillna(mean_price)
7. Fill all missing values in last_review
using forward filling method¶
In [18]:
ffill_review = df['last_review'].fillna(method='ffill')
8. Select duplicate hosts in a dataframe based on name
, host_id
, and price
columns¶
In [21]:
duplicate_hosts = df[df.duplicated(subset=['name', 'host_id', 'price'], keep=False)]
10. Drop duplicates while keeping the first non-NaN value based on name
, host_id
, and price
columns¶
In [23]:
df_unique_hosts = df.drop_duplicates(subset=['name', 'host_id', 'price'],keep='first')
11. How many users in the Column room_type
are Private room
?¶
In [27]:
private_room_counts = df['room_type'].str.count('Private room')
private_room_counts = private_room_counts.sum()
private_room_counts
Out[27]:
22326
12. Find the words in Column name
which contain the substring park
¶
In [32]:
names_having_park =df['name'].str.contains('park')
names_having_park.sum()
Out[32]:
337
13. Replace the neighbourhood having Kitchen
with Restaurant
¶
In [33]:
kitchen_to_restaurant =df['neighbourhood'].str.replace('Kitchen','Restaurant')
14. Split the strings in the room_type
column at
(space) to find whether it is room or home/apt¶
In [36]:
roomOrhome = df['room_type'].str.split(' ').str[1]
15. Clean the column availability_365
by removing invalid values¶
In [38]:
df_invalid_availability_365 = df.loc[df['availability_365']== 0]
17. Identify outliers in minimum_nights
column¶
In [40]:
Nights_plot = df['minimum_nights'].plot(kind='hist')
Nights_plot.set_yscale('log')
Nights_plot.set_xlabel('minimum_nights')
Nights_plot.set_ylabel('Frequency')
Out[40]:
Text(0, 0.5, 'Frequency')
In [41]:
df_nights = df.copy()
df_nights['Min_Nights_cleaned'] = ((df_nights['minimum_nights'] - df_nights['minimum_nights'].mean()) / df_nights['minimum_nights'].std()).abs() >= 43
18. Identify outliers in price
column¶
In [43]:
df_Price = df.copy()
Price_plot = df_Price['price'].plot(kind='box')
Price_plot.set_yscale('log')
In [45]:
>>> Q1 = df_Price['price'].quantile(0.25)
>>> Q3 = df_Price['price'].quantile(0.75)
>>> IQR = Q3 - Q1
>>> df_Price['Price_cleaned'] = ((df_Price['price'] < Q1 - 1.5 * IQR) | (df_Price['price'] > Q3 + 1.5 * IQR))
In [ ]: