Joel Talemwa has successfully completed this project.

Data Cleaning with Pandas

medium

4.19

Capstone Project: Cleaning and Preparing Book Sales Data

Finished

May 12, 2025 2:38 PM

Elapsed time (min)

Completed activities

Resolution

Activities

Project.ipynb

Notebook

A Data Cleaning Capstone Project for Book Sales Analysis¶

In [1]:

import numpy as np
import pandas as pd
import random
import re

Exploration¶

In [2]:

df = pd.read_csv('data/book_sales.csv')

In [3]:

df.head()

Out[3]:

	Unnamed: 0	book_id	title	author	publication_date	price	genre	rating	sales	in_stock	isbn
0	0	1	Book 1	Alice Brown	1990-01-01 00:00:00.000000000	32.319232	Fiction	1.0	3110	no	NaN
1	1	2	Book 2	NaN	1990-01-07 05:04:42.741370685	29.601384	Mystery	2.0	2166	No	ISBN-0000001
2	2	3	Book 3	Bob Johnson	1990-01-13 10:09:25.482741370	15.437619	Mystery	2.0	8526	No	ISBN-0000002
3	3	4	Book 4	NaN	1990-01-19 15:14:08.224112056	22.590770	NaN	NaN	9551	No	ISBN-0000003
4	4	5	Book 5	NaN	1990-01-25 20:18:50.965482741	-23.742729	Sci-fi	4.0	5193	no	ISBN-0000004

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        2200 non-null   int64  
 1   book_id           2200 non-null   int64  
 2   title             2200 non-null   object 
 3   author            1748 non-null   object 
 4   publication_date  2169 non-null   object 
 5   price             2044 non-null   float64
 6   genre             1823 non-null   object 
 7   rating            1845 non-null   float64
 8   sales             2200 non-null   int64  
 9   in_stock          1874 non-null   object 
 10  isbn              1939 non-null   object 
dtypes: float64(2), int64(3), object(6)
memory usage: 189.2+ KB

In [5]:

df.describe()

Out[5]:

	Unnamed: 0	book_id	price	rating	sales
count	2200.000000	2200.000000	2044.000000	1845.000000	2200.000000
mean	1099.500000	1100.500000	21.765138	2.998916	4818.034545
std	635.229617	635.229617	20.234364	1.395297	2959.692739
min	0.000000	1.000000	-49.549807	1.000000	-100.000000
25%	549.750000	550.750000	11.292324	2.000000	2320.750000
50%	1099.500000	1100.500000	24.449520	3.000000	4692.500000
75%	1649.250000	1650.250000	36.862812	4.000000	7346.000000
max	2199.000000	2200.000000	49.925638	5.000000	9998.000000

¶

1. Replace Missing Authors in `author` Column¶

In [6]:

# Replace missing authors with "Anonymous"
df['author'] = df['author'].fillna("Anonymous")

2. Generate Custom `ISBNs` for Missing Entries in `isbn` Column¶

In [8]:

# Create custom ISBNs for books with missing ones
df['isbn'] = df.apply(lambda row: f"CUSTOM-{row['book_id']:05d}" if pd.isnull(row['isbn']) else row['isbn'], axis=1)

3. Categorize Missing Genres as `'Unclassified'` in `genre` Column¶

In [10]:

# Replace missing genres with "Unclassified"
df['genre'] = df['genre'].fillna("Unclassified")

4. Impute Missing `rating` Values Using Genre Averages¶

In [12]:

# Calculate average rating by genre
genre_avg_rating = df.groupby('genre')['rating'].mean()


# Function to assign rating based on genre average
def assign_rating(row):
    if pd.isnull(row['rating']):
        return genre_avg_rating.get(row['genre'], df['rating'].mean())
    return row['rating']

# Apply the function to fill missing ratings
df['rating'] = df.apply(assign_rating, axis=1)

5. Standardize `publication_date` and Fill Missing Values¶

In [14]:

# Convert publication_date to datetime
df['publication_date'] = pd.to_datetime(df['publication_date'], errors='coerce')

# Find the earliest date in the dataset
earliest_date = df['publication_date'].min()


# Replace missing dates with the earliest date
df['publication_date'] =df['publication_date'].fillna(earliest_date)
df['publication_date'].describe()

Out[14]:

count                             2200
mean     2006-09-06 04:46:31.454636288
min                1990-01-01 00:00:00
25%      1998-01-15 12:52:45.982991360
50%      2006-07-19 21:25:07.353676800
75%      2015-04-30 15:12:52.586293248
max                2023-12-31 00:00:00
Name: publication_date, dtype: object

6. Remove Duplicate Books Using `title` and `author` Columns¶

In [16]:

# Remove duplicates, keeping the first occurrence
df_cleaned = df.drop_duplicates(subset=['title', 'author'], keep='first')

7. Deduplicate Entries Using Lowest `book_id` for Identical Books¶

In [18]:

# Sort by 'book_id', remove duplicates based on 'title', 'author', and 'isbn', keeping the first occurrence
df_cleaned = df.sort_values('book_id').drop_duplicates(subset=['title', 'author', 'isbn'], keep='first')

8. Consolidate Duplicate Books Using Median `price` and Aggregate Data¶

In [20]:

# Group by title, author, and isbn, then keep the median price
df_cleaned = df.groupby(['title', 'author', 'isbn']).agg({
    'price': 'median',
    'book_id': 'first',  
    'publication_date': 'first',
    'genre': 'first',
    'rating': 'mean',
    'sales': 'sum'
}).reset_index()

In [21]:

# Group by title, author, and isbn, then combine genres
df_cleaned = df.groupby(['title', 'author', 'isbn']).agg({
    'genre': lambda x: ', '.join(set(x)),
    'book_id': 'first', 
    'publication_date': 'first',
    'price': 'mean',
    'rating': 'mean',
    'sales': 'sum'
}).reset_index()

10. Create catchy slogans for `title` column using random phrases¶

In [23]:

# Set the random seed for reproducibility
random.seed(42)

# Define a list of catchy marketing phrases
marketing_phrases = ["Must-Read!", "Unputdownable!", "Life-Changing!", "Epic Tale!", "Mind-Blowing!"]
# Add a random marketing phrase to the existing title
df['slogan'] = df['title'] + ' - ' + df['title'].apply(lambda x: random.choice(marketing_phrases))

11. Extract hidden message from `isbn` column¶

In [27]:

# Extract characters at specific positions from isbn
df['hidden_message'] = df['isbn'].str[4] + df['isbn'].str[7] + df['isbn'].str[10]

# Replace NaN values with "No message"
df['hidden_message'] = df['hidden_message'].fillna("No message")

# Display the first few rows with hidden messages
df[['isbn', 'hidden_message']].head(10)

Out[27]:

	isbn	hidden_message
0	CUSTOM-00001	O00
1	ISBN-0000001	-00
2	ISBN-0000002	-00
3	ISBN-0000003	-00
4	ISBN-0000004	-00
5	ISBN-0000005	-00
6	ISBN-0000006	-00
7	ISBN-0000007	-00
8	ISBN-0000008	-00
9	ISBN-0000009	-00

12. Determine publication decade from `publication_date` column¶

In [29]:

# Extract year from publication_date
df['year'] = df['publication_date'].dt.year

# Create decade column
df['decade'] = (df['year'] // 10 * 10).astype(str) + 's'

# Replace NaN values with "Unknown"
df['decade'] = df['decade'].fillna("Unknown")

# Display the distribution of books by decade
df['decade'].value_counts().sort_index()

Out[29]:

decade
1990s    678
2000s    640
2010s    621
2020s    261
Name: count, dtype: int64

13. Generate author pseudonyms using `author` and `genre` columns¶

In [31]:

# Set a fixed random seed
random.seed(42)  

# Split author name and get first name
df['first_name'] = df['author'].str.split().str[0]

# Get first three letters of genre
df['genre_prefix'] = df['genre'].str[:3].str.upper()

# Generate random numbers using a list comprehension
df['random_num'] = [random.randint(10, 99) for _ in range(len(df))]

# Combine to create pseudonyms
df['pseudonym'] = df['first_name'] + df['genre_prefix'] + df['random_num'].astype(str)

# Replace NaN values with "Anonymous"
df['pseudonym'] = df['pseudonym'].fillna("Anonymous")

# Clean up temporary columns
df = df.drop(['first_name', 'genre_prefix', 'random_num'], axis=1)

# Display some examples
df[['author', 'genre', 'pseudonym']].head(10)

Out[31]:

	author	genre	pseudonym
0	Alice Brown	Fiction	AliceFIC91
1	Anonymous	Mystery	AnonymousMYS24
2	Bob Johnson	Mystery	BobMYS13
3	Anonymous	Unclassified	AnonymousUNC45
4	Anonymous	Sci-fi	AnonymousSCI41
5	Jane Smith	Mystery	JaneMYS38
6	Bob Johnson	Fiction	BobFIC27
7	Bob Johnson	Sci-fi	BobSCI23
8	Bob Johnson	Non-Fiction	BobNON96
9	Anonymous	Mystery	AnonymousMYS79

14. The Literary Diversity Challenge¶

In [33]:

df_copy = df.copy()

df_copy['unique_letter_count'] = df_copy['title'].str.lower().str.replace(' ', '').apply(lambda x: len(set(x)))
most_unique = df_copy.loc[df_copy['unique_letter_count'].idxmax()]

15. Adjust negative prices in `price` column using genre averages¶

In [34]:

# Calculate mean prices by genre
genre_mean_prices = df[df['price'] > 0].groupby('genre')['price'].mean()

# Function to replace negative prices
def replace_negative_price(row):
    if row['price'] < 0:
        return genre_mean_prices.get(row['genre'], df['price'].mean())
    return row['price']

# Apply the function
df['price'] = df.apply(replace_negative_price, axis=1)

# Verify results
df[df['price'] < 0] # Should return an empty DataFrame

Out[34]:

	Unnamed: 0	book_id	title	author	publication_date	price	genre	rating	sales	in_stock	isbn	slogan	hidden_message	year	decade	pseudonym

16. Standardize `in_stock` column to boolean values¶

In [36]:

# Define mapping for various representations
stock_mapping = {
    'Y': True, 'y': True, 'Yes': True, 'yes': True,
    'N': False, 'n': False, 'No': False, 'no': False
}

# Apply mapping and convert to boolean
df['in_stock'] = df['in_stock'].map(stock_mapping).astype('boolean')

# Fill NaN values with False (assuming missing means not in stock)
df['in_stock'] = df['in_stock'].fillna(False)

# Verify unique values
df['in_stock'].unique()

Out[36]:

<BooleanArray>
[False, True]
Length: 2, dtype: boolean

17. Extract Year from `publication_date` to New Column¶

In [38]:

# Extract year and create a new column
df['publication_year'] = df['publication_date'].dt.year

# Display the first few rows to verify the changes
df[['publication_date', 'publication_year']].head()

Out[38]:

	publication_date	publication_year
0	1990-01-01 00:00:00.000000000	1990
1	1990-01-07 05:04:42.741370685	1990
2	1990-01-13 10:09:25.482741370	1990
3	1990-01-19 15:14:08.224112056	1990
4	1990-01-25 20:18:50.965482741	1990

18. Categorizing Book Genres¶

In [ ]:

19. Which Data Type Best Suits the `isbn` Column?¶

In [ ]:

20. Mystery of the Missing Authors¶

In [ ]:

21. The ISBN Cleanup Challenge¶

In [ ]:

Statement of Completion#1427fdf5

Data Cleaning with Pandas

Capstone Project: Cleaning and Preparing Book Sales Data

A Data Cleaning Capstone Project for Book Sales Analysis¶

Exploration¶

¶

1. Replace Missing Authors in author Column¶

2. Generate Custom ISBNs for Missing Entries in isbn Column¶

3. Categorize Missing Genres as 'Unclassified' in genre Column¶

4. Impute Missing rating Values Using Genre Averages¶

5. Standardize publication_date and Fill Missing Values¶

6. Remove Duplicate Books Using title and author Columns¶

7. Deduplicate Entries Using Lowest book_id for Identical Books¶

8. Consolidate Duplicate Books Using Median price and Aggregate Data¶

9. Merge Multi-Genre Books and Aggregate Related Data¶

10. Create catchy slogans for title column using random phrases¶

11. Extract hidden message from isbn column¶

12. Determine publication decade from publication_date column¶

13. Generate author pseudonyms using author and genre columns¶

14. The Literary Diversity Challenge¶

15. Adjust negative prices in price column using genre averages¶

16. Standardize in_stock column to boolean values¶

17. Extract Year from publication_date to New Column¶

18. Categorizing Book Genres¶

19. Which Data Type Best Suits the isbn Column?¶

20. Mystery of the Missing Authors¶

21. The ISBN Cleanup Challenge¶

1. Replace Missing Authors in `author` Column¶

2. Generate Custom `ISBNs` for Missing Entries in `isbn` Column¶

3. Categorize Missing Genres as `'Unclassified'` in `genre` Column¶

4. Impute Missing `rating` Values Using Genre Averages¶

5. Standardize `publication_date` and Fill Missing Values¶

6. Remove Duplicate Books Using `title` and `author` Columns¶

7. Deduplicate Entries Using Lowest `book_id` for Identical Books¶

8. Consolidate Duplicate Books Using Median `price` and Aggregate Data¶

10. Create catchy slogans for `title` column using random phrases¶

11. Extract hidden message from `isbn` column¶

12. Determine publication decade from `publication_date` column¶

13. Generate author pseudonyms using `author` and `genre` columns¶

15. Adjust negative prices in `price` column using genre averages¶

16. Standardize `in_stock` column to boolean values¶

17. Extract Year from `publication_date` to New Column¶

19. Which Data Type Best Suits the `isbn` Column?¶