Statement of Completion#d592cdc8
Data Cleaning with Pandas
easy
Cleaning duplicate data from an Online Retail store
Resolution
Activities
Importing the libraries¶
In [1]:
#to read the dataframe
import pandas as pd
Reading the dataset as a dataframe¶
In [49]:
# Reading dataframe
df = pd.read_csv('OnlineRetail.csv')
In [50]:
df.head()
Out[50]:
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
| 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
Discovering the dataset¶
In [51]:
#show the number of cells in the dataframe
print("dataset size: ", df.size)
#show the number of records (rows) in the dataframe
print("number of talks: ", len(df))
#show the number of features (coulmns) in the dataframe
print("number of features: ", len(df.columns))
dataset size: 4335272 number of talks: 541909 number of features: 8
In [52]:
df.dtypes
Out[52]:
InvoiceNo object StockCode object Description object Quantity int64 InvoiceDate object UnitPrice float64 CustomerID float64 Country object dtype: object
Basic Activities¶
Let's start with a few simple activities regarding discovering duplicates and dropping them.
4- Select duplicate rows in a dataframe from the dataset¶
In [53]:
# Create a new DataFrame with only the duplicate rows
duplicate_rows = df[df.duplicated(keep='first')]
duplicate_rows
Out[53]:
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 517 | 536409 | 21866 | UNION JACK FLAG LUGGAGE TAG | 1 | 12/1/2010 11:45 | 1.25 | 17908.0 | United Kingdom |
| 527 | 536409 | 22866 | HAND WARMER SCOTTY DOG DESIGN | 1 | 12/1/2010 11:45 | 2.10 | 17908.0 | United Kingdom |
| 537 | 536409 | 22900 | SET 2 TEA TOWELS I LOVE LONDON | 1 | 12/1/2010 11:45 | 2.95 | 17908.0 | United Kingdom |
| 539 | 536409 | 22111 | SCOTTIE DOG HOT WATER BOTTLE | 1 | 12/1/2010 11:45 | 4.95 | 17908.0 | United Kingdom |
| 555 | 536412 | 22327 | ROUND SNACK BOXES SET OF 4 SKULLS | 1 | 12/1/2010 11:49 | 2.95 | 17920.0 | United Kingdom |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 541675 | 581538 | 22068 | BLACK PIRATE TREASURE CHEST | 1 | 12/9/2011 11:34 | 0.39 | 14446.0 | United Kingdom |
| 541689 | 581538 | 23318 | BOX OF 6 MINI VINTAGE CRACKERS | 1 | 12/9/2011 11:34 | 2.49 | 14446.0 | United Kingdom |
| 541692 | 581538 | 22992 | REVOLVER WOODEN RULER | 1 | 12/9/2011 11:34 | 1.95 | 14446.0 | United Kingdom |
| 541699 | 581538 | 22694 | WICKER STAR | 1 | 12/9/2011 11:34 | 2.10 | 14446.0 | United Kingdom |
| 541701 | 581538 | 23343 | JUMBO BAG VINTAGE CHRISTMAS | 1 | 12/9/2011 11:34 | 2.08 | 14446.0 | United Kingdom |
5268 rows × 8 columns
5- How many duplicate rows in the dataset?¶
In [54]:
# Write your code here
6- Find and drop duplicate rows based on InvoiceNo, StockCode, Quantity, and UnitPrice columns¶
In [55]:
df_without_duplicate_orders = df.drop_duplicates(subset=['InvoiceNo', 'StockCode', 'Quantity', 'UnitPrice'], inplace=False)
7- Drop duplicates while keeping the first non-NaN value based on InvoiceNo, StockCode, and CustomerID columns¶
In [56]:
df_keep_first = df.drop_duplicates(subset=["InvoiceNo","StockCode","CustomerID"],keep='first')
df_keep_first
Out[56]:
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
| 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 541904 | 581587 | 22613 | PACK OF 20 SPACEBOY NAPKINS | 12 | 12/9/2011 12:50 | 0.85 | 12680.0 | France |
| 541905 | 581587 | 22899 | CHILDREN'S APRON DOLLY GIRL | 6 | 12/9/2011 12:50 | 2.10 | 12680.0 | France |
| 541906 | 581587 | 23254 | CHILDRENS CUTLERY DOLLY GIRL | 4 | 12/9/2011 12:50 | 4.15 | 12680.0 | France |
| 541907 | 581587 | 23255 | CHILDRENS CUTLERY CIRCUS PARADE | 4 | 12/9/2011 12:50 | 4.15 | 12680.0 | France |
| 541908 | 581587 | 22138 | BAKING SET 9 PIECE RETROSPOT | 3 | 12/9/2011 12:50 | 4.95 | 12680.0 | France |
531225 rows × 8 columns
8- Drop duplicates while keeping the last order based on StockCode and InvoiceWeekday columns¶
In [57]:
# Convert InvoiceDate column to datetime
df_weekday = df
df_weekday['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
# Create a new column for weekday
df_weekday['InvoiceWeekday'] = df['InvoiceDate'].dt.weekday
# Drop duplicates based on StockCode and InvoiceWeekday
df_unique_stock_day = df.drop_duplicates(['StockCode',"InvoiceWeekday"],keep='last')
Continue Activities¶
9- Drop all duplicate invoices¶
In [63]:
df_black_friday = df.drop_duplicates(subset=['InvoiceNo'],keep=False,inplace=False)
10- Drop duplicate countries while keeping the first row¶
In [65]:
df_unique_countries = df.drop_duplicates(subset=['Country'],keep="first")
11- Drop duplicate products while keeping last based on StockCode, Description, and UnitPrice¶
In [67]:
df_unique_products = df.drop_duplicates(subset=['StockCode','Description','UnitPrice'],keep="last")
12- Drop all duplicate rows based on TotalCost and CustomerID while keeping first¶
In [72]:
df_cost = df
# Create a new column for TotalCost
df_cost['TotalCost'] = df_cost['Quantity']*df_cost['UnitPrice']
df_customer_unique_payments = df.drop_duplicates(subset=['TotalCost','CustomerID'],keep='first')
13- Drop all duplicate rows while keeping first¶
In [74]:
df_unique = df.drop_duplicates(keep="first")