Importing the libraries¶

In [1]:

#to read the dataframe
import pandas as pd

Reading the dataset as a dataframe¶

In [49]:

# Reading dataframe
df = pd.read_csv('OnlineRetail.csv')

In [50]:

df.head()

Out[50]:

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850.0	United Kingdom

Discovering the dataset¶

In [51]:

#show the number of cells in the dataframe
print("dataset size: ", df.size)

#show the number of records (rows) in the dataframe
print("number of talks: ", len(df))

#show the number of features (coulmns) in the dataframe
print("number of features: ", len(df.columns))

dataset size:  4335272
number of talks:  541909
number of features:  8

In [52]:

df.dtypes

Out[52]:

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

Basic Activities¶

Let's start with a few simple activities regarding discovering duplicates and dropping them.

4- Select duplicate rows in a dataframe from the dataset¶

In [53]:

# Create a new DataFrame with only the duplicate rows
duplicate_rows = df[df.duplicated(keep='first')]
duplicate_rows

Out[53]:

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
517	536409	21866	UNION JACK FLAG LUGGAGE TAG	1	12/1/2010 11:45	1.25	17908.0	United Kingdom
527	536409	22866	HAND WARMER SCOTTY DOG DESIGN	1	12/1/2010 11:45	2.10	17908.0	United Kingdom
537	536409	22900	SET 2 TEA TOWELS I LOVE LONDON	1	12/1/2010 11:45	2.95	17908.0	United Kingdom
539	536409	22111	SCOTTIE DOG HOT WATER BOTTLE	1	12/1/2010 11:45	4.95	17908.0	United Kingdom
555	536412	22327	ROUND SNACK BOXES SET OF 4 SKULLS	1	12/1/2010 11:49	2.95	17920.0	United Kingdom
...	...	...	...	...	...	...	...	...
541675	581538	22068	BLACK PIRATE TREASURE CHEST	1	12/9/2011 11:34	0.39	14446.0	United Kingdom
541689	581538	23318	BOX OF 6 MINI VINTAGE CRACKERS	1	12/9/2011 11:34	2.49	14446.0	United Kingdom
541692	581538	22992	REVOLVER WOODEN RULER	1	12/9/2011 11:34	1.95	14446.0	United Kingdom
541699	581538	22694	WICKER STAR	1	12/9/2011 11:34	2.10	14446.0	United Kingdom
541701	581538	23343	JUMBO BAG VINTAGE CHRISTMAS	1	12/9/2011 11:34	2.08	14446.0	United Kingdom

5268 rows × 8 columns

5- How many duplicate rows in the dataset?¶

In [54]:

# Write your code here

6- Find and drop duplicate rows based on InvoiceNo, StockCode, Quantity, and UnitPrice columns¶

In [55]:

df_without_duplicate_orders = df.drop_duplicates(subset=['InvoiceNo', 'StockCode', 'Quantity', 'UnitPrice'], inplace=False)

7- Drop duplicates while keeping the first non-NaN value based on InvoiceNo, StockCode, and CustomerID columns¶

In [56]:

df_keep_first = df.drop_duplicates(subset=["InvoiceNo","StockCode","CustomerID"],keep='first')
df_keep_first

Out[56]:

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
...	...	...	...	...	...	...	...	...
541904	581587	22613	PACK OF 20 SPACEBOY NAPKINS	12	12/9/2011 12:50	0.85	12680.0	France
541905	581587	22899	CHILDREN'S APRON DOLLY GIRL	6	12/9/2011 12:50	2.10	12680.0	France
541906	581587	23254	CHILDRENS CUTLERY DOLLY GIRL	4	12/9/2011 12:50	4.15	12680.0	France
541907	581587	23255	CHILDRENS CUTLERY CIRCUS PARADE	4	12/9/2011 12:50	4.15	12680.0	France
541908	581587	22138	BAKING SET 9 PIECE RETROSPOT	3	12/9/2011 12:50	4.95	12680.0	France

531225 rows × 8 columns

8- Drop duplicates while keeping the last order based on StockCode and InvoiceWeekday columns¶

In [57]:

# Convert InvoiceDate column to datetime
df_weekday = df

df_weekday['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create a new column for weekday
df_weekday['InvoiceWeekday'] = df['InvoiceDate'].dt.weekday

# Drop duplicates based on StockCode and InvoiceWeekday
df_unique_stock_day = df.drop_duplicates(['StockCode',"InvoiceWeekday"],keep='last')

Continue Activities¶

9- Drop all duplicate invoices¶

In [63]:

df_black_friday = df.drop_duplicates(subset=['InvoiceNo'],keep=False,inplace=False)

10- Drop duplicate countries while keeping the first row¶

In [65]:

df_unique_countries = df.drop_duplicates(subset=['Country'],keep="first")

11- Drop duplicate products while keeping last based on StockCode, Description, and UnitPrice¶

In [67]:

df_unique_products = df.drop_duplicates(subset=['StockCode','Description','UnitPrice'],keep="last")

12- Drop all duplicate rows based on TotalCost and CustomerID while keeping first¶

In [72]:

df_cost = df
# Create a new column for TotalCost
df_cost['TotalCost'] = df_cost['Quantity']*df_cost['UnitPrice']

df_customer_unique_payments = df.drop_duplicates(subset=['TotalCost','CustomerID'],keep='first')

13- Drop all duplicate rows while keeping first¶

In [74]:

df_unique = df.drop_duplicates(keep="first")

Statement of Completion#d592cdc8

Data Cleaning with Pandas

Cleaning duplicate data from an Online Retail store