Statement of Completion#d662984e
Data Cleaning with Pandas
medium
Capstone Project: Cleaning Google Playstore data
Resolution
Activities
Project.ipynb
In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno
In [9]:
df = pd.read_csv('googleplaystore.csv')
df.sample(5)
Out[9]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5299 | AK 47 Lock Screen | PERSONALIZATION | 4.6 | 5 | 8.8M | 500+ | Free | 0 | Mature 17+ | Personalization | 11-Feb-18 | 1 | 4.0.3 and up |
2800 | ASOS | SHOPPING | 4.7 | 181823 | 22M | 10,000,000+ | Free | 0 | Everyone | Shopping | 30-Jul-18 | 4.4.9 | 4.4 and up |
9240 | tv-ec | FAMILY | NaN | 1 | 3.4M | 50+ | Free | 0 | Everyone | Entertainment | 7-Apr-18 | 8.1 | 4.0.3 and up |
297 | LINE WEBTOON - Free Comics | COMICS | 4.5 | 1013635 | Varies with device | 10,000,000+ | Free | 0 | Teen | Comics | 31-Jul-18 | Varies with device | Varies with device |
9711 | EP Lab Digest | MEDICAL | 2.3 | 11 | 12M | 500+ | Free | 0 | Everyone | Medical | 12-Jul-18 | 1000000.6 | 4.1 and up |
Data Cleaning¶
1. Which of the following column(s) has/have null values?¶
In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10841 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10841 non-null object 1 Category 10841 non-null object 2 Rating 9367 non-null float64 3 Reviews 10841 non-null object 4 Size 10841 non-null object 5 Installs 10841 non-null object 6 Type 10840 non-null object 7 Price 10841 non-null object 8 Content Rating 10840 non-null object 9 Genres 10841 non-null object 10 Last Updated 10841 non-null object 11 Current Ver 10833 non-null object 12 Android Ver 10838 non-null object dtypes: float64(1), object(12) memory usage: 1.1+ MB
2. Clean the Rating
column and the other columns containing null values¶
In [11]:
df.loc[df["Rating"] > 5, "Rating"] = np.nan
In [12]:
df.fillna({"Rating": df["Rating"].mean()}, inplace=True)
In [13]:
df.dropna(inplace=True)
3. Clean the column Reviews
and make it numeric¶
In [15]:
new_r = (
pd.to_numeric(
df.loc[df["Reviews"].str.contains("M"), "Reviews"].str.replace("M", ""))* 1000000).astype(str)
new_r
Out[15]:
72 2000000.0 1778 1000000.0 1781 6400000.0 Name: Reviews, dtype: object
In [16]:
df.loc[df["Reviews"].str.contains("M"), "Reviews"] = new_r
In [17]:
df["Reviews"] = pd.to_numeric(df["Reviews"])
4. How many duplicated apps are there?¶
In [19]:
df.duplicated(subset="App", keep=False).sum()
Out[19]:
1979
5. Drop duplicated apps keeping the ones with the greatest number of reviews¶
In [20]:
df.sort_values(by=["App", "Reviews"], inplace=True)
In [21]:
df.drop_duplicates(subset=["App"], keep="last", inplace=True)
6. Format the Category
column¶
In [23]:
df["Category"] = df["Category"].str.replace("_", " ").str.capitalize()
In [ ]:
7. Clean and convert the Installs
column to numeric type¶
In [25]:
df.loc[pd.to_numeric(df["Installs"], errors="coerce").isna()].head(2)
Out[25]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8884 | "i DT" Fútbol. Todos Somos Técnicos. | Sports | 4.197727 | 27.0 | 3.6M | 500+ | Free | 0 | Everyone | Sports | 7-Oct-17 | 0.22 | 4.1 and up |
324 | #NAME? | Comics | 3.500000 | 115.0 | 9.1M | 10,000+ | Free | 0 | Mature 17+ | Comics | 13-Jul-18 | 5.0.12 | 5.0 and up |
In [26]:
df["Installs"] = pd.to_numeric(df["Installs"].str.replace("+", "").str.replace(",", ""))
8. Clean and convert the Size
column to numeric (representing bytes)¶
In [32]:
df["Size"] = df["Size"].str.replace("Varies with device", "0")
In [28]:
df.loc[df["Size"].str.contains('k'), "Size"] = (
pd.to_numeric(
df.loc[df["Size"].str.contains('k'), "Size"].str.replace("k", "")
) * 1024
).astype(str)
In [29]:
df.loc[df["Size"].str.contains('M'), "Size"] = (
pd.to_numeric(
df.loc[df["Size"].str.contains('M'), "Size"].str.replace("M", "")
) * 1024 * 1024
).astype(str)
In [33]:
df["Size"] = pd.to_numeric(df["Size"])
9. Clean and convert the Price
column to numeric¶
In [35]:
df["Price"] = pd.to_numeric(df["Price"].str.replace("Free", "0").str.replace("$", ""))
10. Paid or free?¶
In [37]:
df["Distribution"] = df["Price"].apply(lambda p: "Paid" if p > 0 else "Free")
Analysis¶
11. Which app has the most reviews?¶
In [39]:
df.sort_values(by=["Reviews"], ascending=False).head(2)
Out[39]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2544 | Social | 4.1 | 78158306.0 | 0.0 | 1000000000 | Free | 0.0 | Teen | Social | 3-Aug-18 | Varies with device | Varies with device | Free | |
381 | WhatsApp Messenger | Communication | 4.4 | 69119316.0 | 0.0 | 1000000000 | Free | 0.0 | Everyone | Communication | 3-Aug-18 | Varies with device | Varies with device | Free |
12. What category has the highest number of apps uploaded to the store?¶
In [40]:
df["Category"].value_counts()
Out[40]:
Category Family 1874 Game 945 Tools 827 Business 420 Medical 395 Productivity 374 Personalization 374 Lifestyle 369 Finance 345 Sports 325 Communication 315 Health and fitness 288 Photography 281 News and magazines 254 Social 239 Books and reference 221 Travel and local 219 Shopping 202 Dating 170 Video players 164 Maps and navigation 131 Food and drink 112 Education 105 Entertainment 86 Auto and vehicles 85 Libraries and demo 83 Weather 79 House and home 73 Events 64 Parenting 60 Art and design 60 Comics 56 Beauty 53 Name: count, dtype: int64
13. To which category belongs the most expensive app?¶
In [41]:
df.sort_values(by=["Price"], ascending=False).head(2)
Out[41]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4367 | I'm Rich - Trump Edition | Lifestyle | 3.6 | 275.0 | 7654604.8 | 10000 | Paid | 400.00 | Everyone | Lifestyle | 3-May-18 | 1.0.1 | 4.1 and up | Paid |
5356 | I Am Rich Premium | Finance | 4.1 | 1867.0 | 4928307.2 | 50000 | Paid | 399.99 | Everyone | Finance | 12-Nov-17 | 1.6 | 4.0 and up | Paid |
14. What's the name of the most expensive game?¶
In [44]:
df.loc[df["Category"] == "Game"].sort_values(by=["Price"], ascending=False).head(2)
Out[44]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4203 | The World Ends With You | Game | 4.6 | 4108.0 | 13631488.0 | 10000 | Paid | 17.99 | Everyone 10+ | Arcade | 14-Dec-15 | 1.0.4 | 4.0 and up | Paid |
10782 | Trine 2: Complete Story | Game | 3.8 | 252.0 | 11534336.0 | 10000 | Paid | 16.99 | Teen | Action | 27-Feb-15 | 2.22 | 5.0 and up | Paid |
15. Which is the most popular Finance App?¶
In [43]:
df.loc[df["Category"] == "Finance"].sort_values(by=["Installs"], ascending=False).head(2)
Out[43]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5601 | Google Pay | Finance | 4.2 | 348132.0 | 0.0 | 100000000 | Free | 0.0 | Everyone | Finance | 26-Jul-18 | 2.70.206190089 | Varies with device | Free |
1156 | PayPal | Finance | 4.3 | 659760.0 | 49283072.0 | 50000000 | Free | 0.0 | Everyone | Finance | 18-Jul-18 | 6.28.0 | 4.4 and up | Free |
16. What Teen Game has the most reviews?¶
In [47]:
df.query("Category == 'Game' and `Content Rating` == 'Teen'").sort_values(by=["Reviews"], ascending=False).head(2)
Out[47]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3912 | Asphalt 8: Airborne | Game | 4.5 | 8389714.0 | 96468992.0 | 100000000 | Free | 0.0 | Teen | Racing | 4-Jul-18 | 3.7.1a | 4.0.3 and up | Free |
5417 | Mobile Legends: Bang Bang | Game | 4.4 | 8219586.0 | 103809024.0 | 100000000 | Free | 0.0 | Teen | Action | 24-Jul-18 | 1.2.97.3042 | 4.0.3 and up | Free |
17. Which is the free game with the most reviews?¶
In [48]:
df.query("Distribution == 'Free' and Category == 'Game'").sort_values(by="Reviews", ascending=False).head(2)
Out[48]:
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Distribution | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1879 | Clash of Clans | Game | 4.6 | 44893888.0 | 102760448.0 | 100000000 | Free | 0.0 | Everyone 10+ | Strategy | 15-Jul-18 | 10.322.16 | 4.1 and up | Free |
1917 | Subway Surfers | Game | 4.5 | 27725352.0 | 79691776.0 | 1000000000 | Free | 0.0 | Everyone 10+ | Arcade | 12-Jul-18 | 1.90.0 | 4.1 and up | Free |
18. How many Tb (tebibytes) were transferred (overall) for the most popular Lifestyle app?¶
In [52]:
app = df.query("Category == 'Lifestyle'").sort_values(by=["Installs"], ascending=False).iloc[0]
app
Out[52]:
App Tinder Category Lifestyle Rating 4.0 Reviews 2789775.0 Size 71303168.0 Installs 100000000 Type Free Price 0.0 Content Rating Mature 17+ Genres Lifestyle Last Updated 2-Aug-18 Current Ver 9.5.0 Android Ver 4.4 and up Distribution Free Name: 4587, dtype: object
In [53]:
(app["Installs"] * app["Size"]) / (1024 * 1024 * 1024 * 1024)
Out[53]:
6484.9853515625
In [ ]: