Statement of Completion#1427fdf5
Data Cleaning with Pandas
medium
Capstone Project: Cleaning and Preparing Book Sales Data
Resolution
Activities
Project.ipynb
A Data Cleaning Capstone Project for Book Sales Analysis¶
In [1]:
import numpy as np
import pandas as pd
import random
import re
Exploration¶
In [2]:
df = pd.read_csv('data/book_sales.csv')
In [3]:
df.head()
Out[3]:
Unnamed: 0 | book_id | title | author | publication_date | price | genre | rating | sales | in_stock | isbn | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | Book 1 | Alice Brown | 1990-01-01 00:00:00.000000000 | 32.319232 | Fiction | 1.0 | 3110 | no | NaN |
1 | 1 | 2 | Book 2 | NaN | 1990-01-07 05:04:42.741370685 | 29.601384 | Mystery | 2.0 | 2166 | No | ISBN-0000001 |
2 | 2 | 3 | Book 3 | Bob Johnson | 1990-01-13 10:09:25.482741370 | 15.437619 | Mystery | 2.0 | 8526 | No | ISBN-0000002 |
3 | 3 | 4 | Book 4 | NaN | 1990-01-19 15:14:08.224112056 | 22.590770 | NaN | NaN | 9551 | No | ISBN-0000003 |
4 | 4 | 5 | Book 5 | NaN | 1990-01-25 20:18:50.965482741 | -23.742729 | Sci-fi | 4.0 | 5193 | no | ISBN-0000004 |
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2200 entries, 0 to 2199 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 2200 non-null int64 1 book_id 2200 non-null int64 2 title 2200 non-null object 3 author 1748 non-null object 4 publication_date 2169 non-null object 5 price 2044 non-null float64 6 genre 1823 non-null object 7 rating 1845 non-null float64 8 sales 2200 non-null int64 9 in_stock 1874 non-null object 10 isbn 1939 non-null object dtypes: float64(2), int64(3), object(6) memory usage: 189.2+ KB
In [5]:
df.describe()
Out[5]:
Unnamed: 0 | book_id | price | rating | sales | |
---|---|---|---|---|---|
count | 2200.000000 | 2200.000000 | 2044.000000 | 1845.000000 | 2200.000000 |
mean | 1099.500000 | 1100.500000 | 21.765138 | 2.998916 | 4818.034545 |
std | 635.229617 | 635.229617 | 20.234364 | 1.395297 | 2959.692739 |
min | 0.000000 | 1.000000 | -49.549807 | 1.000000 | -100.000000 |
25% | 549.750000 | 550.750000 | 11.292324 | 2.000000 | 2320.750000 |
50% | 1099.500000 | 1100.500000 | 24.449520 | 3.000000 | 4692.500000 |
75% | 1649.250000 | 1650.250000 | 36.862812 | 4.000000 | 7346.000000 |
max | 2199.000000 | 2200.000000 | 49.925638 | 5.000000 | 9998.000000 |
1. Replace Missing Authors in author
Column¶
In [6]:
# Replace missing authors with "Anonymous"
df['author'] = df['author'].fillna("Anonymous")
2. Generate Custom ISBNs
for Missing Entries in isbn
Column¶
In [8]:
# Create custom ISBNs for books with missing ones
df['isbn'] = df.apply(lambda row: f"CUSTOM-{row['book_id']:05d}" if pd.isnull(row['isbn']) else row['isbn'], axis=1)
3. Categorize Missing Genres as 'Unclassified'
in genre
Column¶
In [10]:
# Replace missing genres with "Unclassified"
df['genre'] = df['genre'].fillna("Unclassified")
4. Impute Missing rating
Values Using Genre Averages¶
In [12]:
# Calculate average rating by genre
genre_avg_rating = df.groupby('genre')['rating'].mean()
# Function to assign rating based on genre average
def assign_rating(row):
if pd.isnull(row['rating']):
return genre_avg_rating.get(row['genre'], df['rating'].mean())
return row['rating']
# Apply the function to fill missing ratings
df['rating'] = df.apply(assign_rating, axis=1)
5. Standardize publication_date
and Fill Missing Values¶
In [14]:
# Convert publication_date to datetime
df['publication_date'] = pd.to_datetime(df['publication_date'], errors='coerce')
# Find the earliest date in the dataset
earliest_date = df['publication_date'].min()
# Replace missing dates with the earliest date
df['publication_date'] =df['publication_date'].fillna(earliest_date)
df['publication_date'].describe()
Out[14]:
count 2200 mean 2006-09-06 04:46:31.454636288 min 1990-01-01 00:00:00 25% 1998-01-15 12:52:45.982991360 50% 2006-07-19 21:25:07.353676800 75% 2015-04-30 15:12:52.586293248 max 2023-12-31 00:00:00 Name: publication_date, dtype: object
6. Remove Duplicate Books Using title
and author
Columns¶
In [16]:
# Remove duplicates, keeping the first occurrence
df_cleaned = df.drop_duplicates(subset=['title', 'author'], keep='first')
7. Deduplicate Entries Using Lowest book_id
for Identical Books¶
In [18]:
# Sort by 'book_id', remove duplicates based on 'title', 'author', and 'isbn', keeping the first occurrence
df_cleaned = df.sort_values('book_id').drop_duplicates(subset=['title', 'author', 'isbn'], keep='first')
8. Consolidate Duplicate Books Using Median price
and Aggregate Data¶
In [20]:
# Group by title, author, and isbn, then keep the median price
df_cleaned = df.groupby(['title', 'author', 'isbn']).agg({
'price': 'median',
'book_id': 'first',
'publication_date': 'first',
'genre': 'first',
'rating': 'mean',
'sales': 'sum'
}).reset_index()
9. Merge Multi-Genre Books and Aggregate Related Data¶
In [21]:
# Group by title, author, and isbn, then combine genres
df_cleaned = df.groupby(['title', 'author', 'isbn']).agg({
'genre': lambda x: ', '.join(set(x)),
'book_id': 'first',
'publication_date': 'first',
'price': 'mean',
'rating': 'mean',
'sales': 'sum'
}).reset_index()
10. Create catchy slogans for title
column using random phrases¶
In [23]:
# Set the random seed for reproducibility
random.seed(42)
# Define a list of catchy marketing phrases
marketing_phrases = ["Must-Read!", "Unputdownable!", "Life-Changing!", "Epic Tale!", "Mind-Blowing!"]
# Add a random marketing phrase to the existing title
df['slogan'] = df['title'] + ' - ' + df['title'].apply(lambda x: random.choice(marketing_phrases))
11. Extract hidden message from isbn
column¶
In [27]:
# Extract characters at specific positions from isbn
df['hidden_message'] = df['isbn'].str[4] + df['isbn'].str[7] + df['isbn'].str[10]
# Replace NaN values with "No message"
df['hidden_message'] = df['hidden_message'].fillna("No message")
# Display the first few rows with hidden messages
df[['isbn', 'hidden_message']].head(10)
Out[27]:
isbn | hidden_message | |
---|---|---|
0 | CUSTOM-00001 | O00 |
1 | ISBN-0000001 | -00 |
2 | ISBN-0000002 | -00 |
3 | ISBN-0000003 | -00 |
4 | ISBN-0000004 | -00 |
5 | ISBN-0000005 | -00 |
6 | ISBN-0000006 | -00 |
7 | ISBN-0000007 | -00 |
8 | ISBN-0000008 | -00 |
9 | ISBN-0000009 | -00 |
12. Determine publication decade from publication_date
column¶
In [29]:
# Extract year from publication_date
df['year'] = df['publication_date'].dt.year
# Create decade column
df['decade'] = (df['year'] // 10 * 10).astype(str) + 's'
# Replace NaN values with "Unknown"
df['decade'] = df['decade'].fillna("Unknown")
# Display the distribution of books by decade
df['decade'].value_counts().sort_index()
Out[29]:
decade 1990s 678 2000s 640 2010s 621 2020s 261 Name: count, dtype: int64
13. Generate author pseudonyms using author
and genre
columns¶
In [31]:
# Set a fixed random seed
random.seed(42)
# Split author name and get first name
df['first_name'] = df['author'].str.split().str[0]
# Get first three letters of genre
df['genre_prefix'] = df['genre'].str[:3].str.upper()
# Generate random numbers using a list comprehension
df['random_num'] = [random.randint(10, 99) for _ in range(len(df))]
# Combine to create pseudonyms
df['pseudonym'] = df['first_name'] + df['genre_prefix'] + df['random_num'].astype(str)
# Replace NaN values with "Anonymous"
df['pseudonym'] = df['pseudonym'].fillna("Anonymous")
# Clean up temporary columns
df = df.drop(['first_name', 'genre_prefix', 'random_num'], axis=1)
# Display some examples
df[['author', 'genre', 'pseudonym']].head(10)
Out[31]:
author | genre | pseudonym | |
---|---|---|---|
0 | Alice Brown | Fiction | AliceFIC91 |
1 | Anonymous | Mystery | AnonymousMYS24 |
2 | Bob Johnson | Mystery | BobMYS13 |
3 | Anonymous | Unclassified | AnonymousUNC45 |
4 | Anonymous | Sci-fi | AnonymousSCI41 |
5 | Jane Smith | Mystery | JaneMYS38 |
6 | Bob Johnson | Fiction | BobFIC27 |
7 | Bob Johnson | Sci-fi | BobSCI23 |
8 | Bob Johnson | Non-Fiction | BobNON96 |
9 | Anonymous | Mystery | AnonymousMYS79 |
14. The Literary Diversity Challenge¶
In [33]:
df_copy = df.copy()
df_copy['unique_letter_count'] = df_copy['title'].str.lower().str.replace(' ', '').apply(lambda x: len(set(x)))
most_unique = df_copy.loc[df_copy['unique_letter_count'].idxmax()]
15. Adjust negative prices in price
column using genre averages¶
In [34]:
# Calculate mean prices by genre
genre_mean_prices = df[df['price'] > 0].groupby('genre')['price'].mean()
# Function to replace negative prices
def replace_negative_price(row):
if row['price'] < 0:
return genre_mean_prices.get(row['genre'], df['price'].mean())
return row['price']
# Apply the function
df['price'] = df.apply(replace_negative_price, axis=1)
# Verify results
df[df['price'] < 0] # Should return an empty DataFrame
Out[34]:
Unnamed: 0 | book_id | title | author | publication_date | price | genre | rating | sales | in_stock | isbn | slogan | hidden_message | year | decade | pseudonym |
---|
16. Standardize in_stock
column to boolean values¶
In [36]:
# Define mapping for various representations
stock_mapping = {
'Y': True, 'y': True, 'Yes': True, 'yes': True,
'N': False, 'n': False, 'No': False, 'no': False
}
# Apply mapping and convert to boolean
df['in_stock'] = df['in_stock'].map(stock_mapping).astype('boolean')
# Fill NaN values with False (assuming missing means not in stock)
df['in_stock'] = df['in_stock'].fillna(False)
# Verify unique values
df['in_stock'].unique()
Out[36]:
<BooleanArray> [False, True] Length: 2, dtype: boolean
17. Extract Year from publication_date
to New Column¶
In [38]:
# Extract year and create a new column
df['publication_year'] = df['publication_date'].dt.year
# Display the first few rows to verify the changes
df[['publication_date', 'publication_year']].head()
Out[38]:
publication_date | publication_year | |
---|---|---|
0 | 1990-01-01 00:00:00.000000000 | 1990 |
1 | 1990-01-07 05:04:42.741370685 | 1990 |
2 | 1990-01-13 10:09:25.482741370 | 1990 |
3 | 1990-01-19 15:14:08.224112056 | 1990 |
4 | 1990-01-25 20:18:50.965482741 | 1990 |
18. Categorizing Book Genres¶
In [ ]:
19. Which Data Type Best Suits the isbn
Column?¶
In [ ]:
20. Mystery of the Missing Authors¶
In [ ]:
21. The ISBN Cleanup Challenge¶
In [ ]: