Statement of Completion#6c131a5a
Intro to Pandas for Data Analysis
medium
Practice DataFrame Mutations using Good Reads Books and Reviews Data
Resolution
Activities
Import the libraries and load the dataset¶
In [1]:
import warnings
# Ignore FutureWarning
warnings.simplefilter(action='ignore', category=FutureWarning)
In [2]:
import pandas as pd
df = pd.read_csv('Best_Books_Ever.csv')
In [3]:
df.head()
Out[3]:
bookId | title | series | author | rating | description | language | isbn | genres | characters | ... | firstPublishDate | awards | numRatings | ratingsByStars | likedPercent | setting | coverImg | bbeScore | bbeVotes | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.Harry_Potter_and_the_Order_of_the_Phoenix | Harry Potter and the Order of the Phoenix | Harry Potter #5 | J.K. Rowling, Mary GrandPré (Illustrator) | 4.50 | There is a door at the end of a silent corrido... | English | 9780439358071 | ['Fantasy', 'Young Adult', 'Fiction', 'Magic',... | ['Sirius Black', 'Draco Malfoy', 'Ron Weasley'... | ... | 2003-06-21 | ['Bram Stoker Award for Works for Young Reader... | 2507623 | ['1593642', '637516', '222366', '39573', '14526'] | 98.0 | ['Hogwarts School of Witchcraft and Wizardry (... | https://i.gr-assets.com/images/S/compressed.ph... | 2632233 | 26923 | 7.38 |
1 | 30.J_R_R_Tolkien_4_Book_Boxed_Set | J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an... | The Lord of the Rings #0-3 | J.R.R. Tolkien | 4.60 | This four-volume, boxed set contains J.R.R. To... | English | 9780345538376 | ['Fantasy', 'Fiction', 'Classics', 'Adventure'... | ['Frodo Baggins', 'Gandalf', 'Bilbo Baggins', ... | ... | 2055-10-20 | [] | 110146 | ['78217', '22857', '6628', '1477', '967'] | 98.0 | ['Middle-earth'] | https://i.gr-assets.com/images/S/compressed.ph... | 1159802 | 12111 | 21.15 |
2 | 375802.Ender_s_Game | Ender's Game | Ender's Saga #1 | Orson Scott Card, Stefan Rudnicki (Narrator), ... | 4.30 | Andrew "Ender" Wiggin thinks he is playing com... | English | 9780812550702 | ['Science Fiction', 'Fiction', 'Young Adult', ... | ['Dink', 'Bernard', 'Valentine Wiggin', 'Peter... | ... | 1985-10-28 | ['Hugo Award for Best Novel (1986)', 'Nebula A... | 1131303 | ['603209', '339819', '132305', '35667', '20303'] | 95.0 | [] | https://i.gr-assets.com/images/S/compressed.ph... | 720651 | 7515 | 4.60 |
3 | 17245.Dracula | Dracula | Dracula #1 | Bram Stoker, Nina Auerbach (Editor), David J. ... | 4.00 | You can find an alternative cover edition for ... | English | 9780393970128 | ['Classics', 'Horror', 'Fiction', 'Fantasy', '... | ['Jonathan Harker', 'Lucy Westenra', 'Abraham ... | ... | 1997-05-26 | [] | 938325 | ['345260', '329217', '197206', '48642', '18000'] | 93.0 | ['Transylvania (Romania)', 'Budapest (Hungary)... | https://i.gr-assets.com/images/S/compressed.ph... | 646782 | 6988 | 4.55 |
4 | 28187.The_Lightning_Thief | The Lightning Thief | Percy Jackson and the Olympians #1 | Rick Riordan (Goodreads Author) | 4.26 | Alternate cover for this ISBN can be found her... | English | 9780786838653 | ['Fantasy', 'Young Adult', 'Mythology', 'Ficti... | ['Annabeth Chase', 'Grover Underwood', 'Luke C... | ... | 2005-06-28 | ["Young Readers' Choice Award (2008)", 'Books ... | 1992300 | ['1006885', '604999', '289310', '64014', '27092'] | 95.0 | ['New York City, New York (United States)', 'M... | https://i.gr-assets.com/images/S/compressed.ph... | 597132 | 6370 | 1.79 |
5 rows × 25 columns
In [4]:
df.shape
Out[4]:
(794, 25)
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 794 entries, 0 to 793 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 bookId 794 non-null object 1 title 794 non-null object 2 series 794 non-null object 3 author 794 non-null object 4 rating 794 non-null float64 5 description 794 non-null object 6 language 794 non-null object 7 isbn 794 non-null int64 8 genres 794 non-null object 9 characters 794 non-null object 10 bookFormat 794 non-null object 11 edition 794 non-null object 12 pages 794 non-null int64 13 publisher 794 non-null object 14 publishDate 794 non-null object 15 firstPublishDate 794 non-null object 16 awards 794 non-null object 17 numRatings 794 non-null int64 18 ratingsByStars 794 non-null object 19 likedPercent 794 non-null float64 20 setting 794 non-null object 21 coverImg 794 non-null object 22 bbeScore 794 non-null int64 23 bbeVotes 794 non-null int64 24 price 794 non-null float64 dtypes: float64(3), int64(5), object(17) memory usage: 155.2+ KB
Activities¶
Activity 1. Calculating the Price-to-Rating Ratio¶
In [6]:
df['price_to_rating'] = df['price'] / df['rating']
Activity 2. Remove the "isbn" Column¶
In [7]:
del df['isbn']
Activity 3. Extract and Add the "Year Published" Column¶
In [8]:
df['YearPublished'] = df['publishDate'].map(lambda x:x[:x.find('-')]).astype('int32')
Activity 4. Filter Books with Ratings Above 4.5¶
In [9]:
best_books = df[df['rating'] >= 4.5]
Activity 5. Count and Add the Number of Genres¶
In [10]:
df['GenreCount'] = df['genres'].map(lambda x: x.count("'")//2)
Activity 6. Split Author Names into First and Last Name Columns¶
In [11]:
df['FirstName'] = df['author'].map(lambda x:" ".join(x.split()[:-1]) if len(x.split())>1 else x)
df['LastName'] = df['author'].map(lambda x:"".join(x.split()[-1]) if len(x.split())>1 else None)
df['LastName'][380]
Activity 7. Drop Books with Fewer than 100 Pages¶
In [12]:
df.drop(df[df['pages']<100].index,inplace=True)
Activity 8. Extract the Primary Genre¶
In [13]:
df['PrimaryGenre'] = df['genres'].map(lambda x:eval(x)[0] if len(eval(x)) > 0 else None)
Activity 9. Flag Books with multiple Awards¶
In [14]:
df['MultipleAwards'] = df['awards'].map(lambda x: True if len(eval(x)) > 1 else False)
Activity 10. Estimate Reading Time Based on Page Count¶
In [15]:
df['ReadingTimeHours'] = df['pages']*300/250/60
Activity 11. Flag books published in year 2000 onwards¶
In [16]:
df['Published21stCentury'] = df['YearPublished'] >= 2000
Activity 12. Simplifying the DataFrame by Dropping Columns¶
In [17]:
df.drop(['coverImg','description','ratingsByStars'],axis=1,inplace=True)
Activity 13. Adding a New Book Entry¶
In [18]:
new_book = {
"bookId": '10000',
"title": "The Great Gatsby",
"author": "F. Scott Fitzgerald",
"rating": 3.9,
"pages": 180,
"publishDate": '1925-04-10',
"publisher": "Scribner",
"price": 7.99,
"genres": "['Fiction', 'Classics']",
"GenreCount": 2,
"FirstName": "F.",
"LastName": "Fitzgerald",
"PrimaryGenre": "Fiction",
"MultipleAwards": False,
"ReadingTimeHours": 9.0,
"Published21stCentury": True
}
new_book = pd.DataFrame(new_book,index=[len(df)])
df = pd.concat([df,new_book])
Activity 14. Transforming Publish Dates into Datetime Format¶
In [19]:
df['publishDate'] = pd.to_datetime(df['publishDate'])
df['firstPublishDate'] = pd.to_datetime(df['firstPublishDate'])
Activity 15. Bulk Adding New Book Entries to the DataFrame¶
In [20]:
new_books = [
{
"bookId": '10001',
"title": "To Kill a Mockingbird",
"author": "Harper Lee",
"rating": 4.3,
"pages": 281,
"publishDate": pd.to_datetime('1960-07-11'),
"firstPublishDate": pd.to_datetime('1960-07-11'),
"publisher": "J.B. Lippincott & Co.",
"price": 9.99,
"genres": "['Fiction', 'Classics']",
"GenreCount": 2,
"FirstName": "Harper",
"LastName": "Lee",
"PrimaryGenre": "Fiction",
"MultipleAwards": False,
"ReadingTimeHours": 11.24,
"Published21stCentury": False
},
{
"bookId": '10002',
"title": "1984",
"author": "George Orwell",
"rating": 4.2,
"pages": 328,
"publishDate": pd.to_datetime('1949-06-08'),
"firstPublishDate": pd.to_datetime('1949-06-08'),
"publisher": "Secker & Warburg",
"price": 12.99,
"genres": "['Fiction', 'Classics']",
"GenreCount": 2,
"FirstName": "George",
"LastName": "Orwell",
"PrimaryGenre": "Fiction",
"MultipleAwards": False,
"ReadingTimeHours": 13.12,
"Published21stCentury": False
}
]
index = len(df)
for i in range(len(new_books)):
new_books[i] = pd.DataFrame(new_books[i],index=[index])
index+=1
df = pd.concat([df,new_books[0],new_books[1]],ignore_index=True)