Statement of Completion#edd76006
Data Cleaning with Pandas
medium
Capstone Project: Cleaning Google Playstore data
Resolution
Activities
Project.ipynb
In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno
In [26]:
df = pd.read_csv('googleplaystore.csv')
Data Cleaning¶
1. Which of the following column(s) has/have null values?¶
In [25]:
df.isna().sum().sort_values(ascending=False)
Out[25]:
Current Ver 8 Android Ver 3 Type 1 Content Rating 1 App 0 Size 0 Reviews 0 Rating 0 Category 0 Price 0 Installs 0 Last Updated 0 Genres 0 dtype: int64
2. Clean the Rating
column and the other columns containing null values¶
In [8]:
df.head(1)
Out[8]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | Free | Everyone | Art & Design | 7-Jan-18 | 1.0.0 | 4.0.3 and up |
In [15]:
df['Rating'].plot(kind='hist')
Out[15]:
<Axes: ylabel='Frequency'>
In [28]:
df.loc[df['Rating'] > 5, 'Rating'] = np.nan
In [29]:
# For `Rating`, which is numeric:
df['Rating'] = df['Rating'].fillna(df['Rating'].mean())
# For the other columns
df.dropna(inplace=True)
3. Clean the column Reviews
and make it numeric¶
In [31]:
df.loc[
df['Reviews'].str.contains('M'),
'Reviews'
] = (pd.to_numeric(df.loc[
df['Reviews'].str.contains('M'),
'Reviews'
].str.replace('M', ''))*1000000).astype('str')
In [32]:
df['Reviews'] = pd.to_numeric(df['Reviews'])
4. How many duplicated apps are there?¶
In [34]:
df.head(1)
Out[34]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159.0 | 19M | 10,000+ | Free | Free | Everyone | Art & Design | 7-Jan-18 | 1.0.0 | 4.0.3 and up |
In [37]:
df.loc[
df.duplicated(subset=['App'], keep=False)
].shape
Out[37]:
(1979, 13)
5. Drop duplicated apps keeping the ones with the greatest number of reviews¶
In [38]:
# df_copy_5 = df.copy()
In [44]:
# df = df_copy_5.copy()
In [45]:
df = df.sort_values(by=['App', 'Reviews'])
In [46]:
df = df.loc[
~df.duplicated(subset=['App'], keep='last')
]
6. Format the Category
column¶
In [51]:
df['Category'] = df['Category'].str.replace('_', ' ')
In [54]:
df['Category'] = df['Category'].str.capitalize()
7. Clean and convert the Installs
column to numeric type¶
In [56]:
# df_copy_7 = df.copy()
In [62]:
df['Installs'] = df['Installs'].str.replace('+', '').str.replace(',', '')
In [63]:
df['Installs'] = pd.to_numeric(df['Installs'])
8. Clean and convert the Size
column to numeric (representing bytes)¶
In [65]:
# df_copy_8 = df.copy()
In [66]:
df.loc[
df['Size'].str.contains('k'),
'Size'
] = (pd.to_numeric(df.loc[
df['Size'].str.contains('k'),
'Size'
].str.replace('k', '')) * 1024).astype('str')
In [67]:
df.loc[
df['Size'].str.contains('M'),
'Size'
] = (pd.to_numeric(df.loc[
df['Size'].str.contains('M'),
'Size'
].str.replace('M', '')) * 1024 * 1024).astype('str')
In [69]:
df.loc[
pd.to_numeric(df['Size'], errors='coerce').isna(),
'Size'
] = '0'
In [70]:
df['Size'] = pd.to_numeric(df['Size'])
9. Clean and convert the Price
column to numeric¶
In [73]:
df['Price'] = pd.to_numeric(df['Price'].str.replace('$', '').str.replace('Free', '0'))
10. Paid or free?¶
In [75]:
df['Distribution'] = df['Price'].apply(lambda x: 'Paid' if x > 0 else 'Free')
Analysis¶
11. Which app has the most reviews?¶
In [78]:
# df.sort_values(by='Reviews', ascending=False).head()
12. What category has the highest number of apps uploaded to the store?¶
In [80]:
df.columns
Out[80]:
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver', 'Distribution'], dtype='object')
In [82]:
df['Category'].value_counts()
Out[82]:
Category Family 1874 Game 945 Tools 827 Business 420 Medical 395 Productivity 374 Personalization 374 Lifestyle 369 Finance 345 Sports 325 Communication 315 Health and fitness 288 Photography 281 News and magazines 254 Social 239 Books and reference 221 Travel and local 219 Shopping 202 Dating 170 Video players 164 Maps and navigation 131 Food and drink 112 Education 105 Entertainment 86 Auto and vehicles 85 Libraries and demo 83 Weather 79 House and home 73 Events 64 Parenting 60 Art and design 60 Comics 56 Beauty 53 Name: count, dtype: int64
13. To which category belongs the most expensive app?¶
In [84]:
# df.sort_values(by='Price', ascending=False)
14. What's the name of the most expensive game?¶
In [85]:
df.loc[
df['Category'] == 'Game'
].sort_values(by='Price', ascending=False)
Out[85]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4203 | The World Ends With You | Game | 4.600000 | 4108.0 | 13631488.0 | 10000 | Paid | 17.99 | Everyone 10+ | Arcade | 14-Dec-15 | 1.0.4 | 4.0 and up | Paid |
10782 | Trine 2: Complete Story | Game | 3.800000 | 252.0 | 11534336.0 | 10000 | Paid | 16.99 | Teen | Action | 27-Feb-15 | 2.22 | 5.0 and up | Paid |
6341 | Blackjack Verite Drills | Game | 4.600000 | 17.0 | 4928307.2 | 100 | Paid | 14.00 | Teen | Casino | 9-Jul-17 | 1.1.10 | 3.0 and up | Paid |
1838 | Star Wars â„¢: DIRTY | Game | 4.500000 | 38207.0 | 15728640.0 | 100000 | Paid | 9.99 | Teen | Role Playing | 19-Oct-15 | 1.0.6 | 4.1 and up | Paid |
6198 | Backgammon NJ for Android | Game | 4.400000 | 1644.0 | 15728640.0 | 10000 | Paid | 7.99 | Everyone | Board | 5-Apr-17 | 4.1 | 2.3.3 and up | Paid |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10522 | Drift Legends | Game | 4.200000 | 33788.0 | 28311552.0 | 1000000 | Free | 0.00 | Everyone | Racing | 29-Mar-18 | 1.8.5 | 4.1 and up | Free |
4434 | Drink-O-Tron The Drinking Game | Game | 4.100000 | 140.0 | 47185920.0 | 50000 | Free | 0.00 | Mature 17+ | Card | 31-May-17 | 1.64 | 4.0.3 and up | Free |
10508 | Drive 4x4 Luxury SUV Jeep | Game | 4.200000 | 2183.0 | 48234496.0 | 500000 | Free | 0.00 | Everyone | Racing | 10-Jul-18 | 1.12 | 2.3 and up | Free |
6967 | Driving Cars Simulator Citroen | Game | 4.197727 | 3.0 | 51380224.0 | 100 | Free | 0.00 | Everyone | Racing | 5-Apr-18 | 1 | 4.4 and up | Free |
1956 | Dragon Hills | Game | 4.500000 | 354373.0 | 69206016.0 | 10000000 | Free | 0.00 | Everyone 10+ | Action | 12-Oct-17 | 1.2.7 | 4.1 and up | Free |
945 rows × 14 columns
15. Which is the most popular Finance App?¶
In [86]:
df.loc[
df['Category'] == 'Finance'
].sort_values(by='Installs', ascending=False).head(1)
Out[86]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5601 | Google Pay | Finance | 4.2 | 348132.0 | 0.0 | 100000000 | Free | 0.0 | Everyone | Finance | 26-Jul-18 | 2.70.206190089 | Varies with device | Free |
16. What Teen Game has the most reviews?¶
In [87]:
df.loc[
(df['Category'] == 'Game') &
(df['Content Rating'] == 'Teen')
].sort_values(by='Reviews', ascending=False).head(1)
Out[87]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3912 | Asphalt 8: Airborne | Game | 4.5 | 8389714.0 | 96468992.0 | 100000000 | Free | 0.0 | Teen | Racing | 4-Jul-18 | 3.7.1a | 4.0.3 and up | Free |
17. Which is the free game with the most reviews?¶
In [89]:
df.loc[
(df['Category'] == 'Game') &
(df['Distribution'] == 'Free')
].sort_values(by='Reviews', ascending=False).head(1)
Out[89]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1879 | Clash of Clans | Game | 4.6 | 44893888.0 | 102760448.0 | 100000000 | Free | 0.0 | Everyone 10+ | Strategy | 15-Jul-18 | 10.322.16 | 4.1 and up | Free |
18. How many Tb (tebibytes) were transferred (overall) for the most popular Lifestyle app?¶
In [91]:
df.loc[
df['Category'] == 'Lifestyle'
].sort_values(by='Installs', ascending=False).head(1)
Out[91]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4587 | Tinder | Lifestyle | 4.0 | 2789775.0 | 71303168.0 | 100000000 | Free | 0.0 | Mature 17+ | Lifestyle | 2-Aug-18 | 9.5.0 | 4.4 and up | Free |
In [ ]: