In [8]:

import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno

In [9]:

df = pd.read_csv('googleplaystore.csv')
df.sample(5)

Out[9]:

	App	Category	Rating	Reviews	Size	Installs	Type	Content Rating	Genres	Last Updated	Current Ver	Android Ver
5299	AK 47 Lock Screen	PERSONALIZATION	4.6	5	8.8M	500+	Free	Mature 17+	Personalization	11-Feb-18	1	4.0.3 and up
2800	ASOS	SHOPPING	4.7	181823	22M	10,000,000+	Free	Everyone	Shopping	30-Jul-18	4.4.9	4.4 and up
9240	tv-ec	FAMILY	NaN	1	3.4M	50+	Free	Everyone	Entertainment	7-Apr-18	8.1	4.0.3 and up
297	LINE WEBTOON - Free Comics	COMICS	4.5	1013635	Varies with device	10,000,000+	Free	Teen	Comics	31-Jul-18	Varies with device	Varies with device
9711	EP Lab Digest	MEDICAL	2.3	11	12M	500+	Free	Everyone	Medical	12-Jul-18	1000000.6	4.1 and up

Data Cleaning¶

1. Which of the following column(s) has/have null values?¶

In [10]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB

2. Clean the `Rating` column and the other columns containing null values¶

In [11]:

df.loc[df["Rating"] > 5, "Rating"] = np.nan

In [12]:

df.fillna({"Rating": df["Rating"].mean()}, inplace=True)

In [13]:

df.dropna(inplace=True)

3. Clean the column `Reviews` and make it numeric¶

In [15]:

new_r = (
    pd.to_numeric(
    df.loc[df["Reviews"].str.contains("M"), "Reviews"].str.replace("M", ""))* 1000000).astype(str)
new_r

Out[15]:

72      2000000.0
1778    1000000.0
1781    6400000.0
Name: Reviews, dtype: object

In [16]:

df.loc[df["Reviews"].str.contains("M"), "Reviews"] = new_r

In [17]:

df["Reviews"] = pd.to_numeric(df["Reviews"])

4. How many duplicated apps are there?¶

In [19]:

df.duplicated(subset="App", keep=False).sum()

Out[19]:

5. Drop duplicated apps keeping the ones with the greatest number of reviews¶

In [20]:

df.sort_values(by=["App", "Reviews"], inplace=True)

In [21]:

df.drop_duplicates(subset=["App"], keep="last", inplace=True)

6. Format the `Category` column¶

In [23]:

df["Category"] = df["Category"].str.replace("_", " ").str.capitalize()

In [ ]:

7. Clean and convert the `Installs` column to numeric type¶

In [25]:

df.loc[pd.to_numeric(df["Installs"], errors="coerce").isna()].head(2)

Out[25]:

	App	Category	Rating	Reviews	Size	Installs	Type	Price	Content Rating	Genres	Last Updated	Current Ver	Android Ver
8884	"i DT" Fútbol. Todos Somos Técnicos.	Sports	4.197727	27.0	3.6M	500+	Free	0	Everyone	Sports	7-Oct-17	0.22	4.1 and up
324	#NAME?	Comics	3.500000	115.0	9.1M	10,000+	Free	0	Mature 17+	Comics	13-Jul-18	5.0.12	5.0 and up

In [26]:

df["Installs"] = pd.to_numeric(df["Installs"].str.replace("+", "").str.replace(",", ""))

8. Clean and convert the `Size` column to numeric (representing bytes)¶

In [32]:

df["Size"] = df["Size"].str.replace("Varies with device", "0")

In [28]:

df.loc[df["Size"].str.contains('k'), "Size"] = (
    pd.to_numeric(
        df.loc[df["Size"].str.contains('k'), "Size"].str.replace("k", "")
        ) * 1024
    ).astype(str)

In [29]:

df.loc[df["Size"].str.contains('M'), "Size"] = (
    pd.to_numeric(
        df.loc[df["Size"].str.contains('M'), "Size"].str.replace("M", "")
        ) * 1024 * 1024
    ).astype(str)

In [33]:

df["Size"] = pd.to_numeric(df["Size"])

9. Clean and convert the `Price` column to numeric¶

In [35]:

df["Price"] = pd.to_numeric(df["Price"].str.replace("Free", "0").str.replace("$", ""))

10. Paid or free?¶

In [37]:

df["Distribution"] = df["Price"].apply(lambda p: "Paid" if p > 0 else "Free")

Analysis¶

11. Which app has the most reviews?¶

In [39]:

df.sort_values(by=["Reviews"], ascending=False).head(2)

Out[39]:

	App	Category	Rating	Reviews	Size	Installs	Type	Price	Content Rating	Genres	Last Updated	Current Ver	Android Ver	Distribution
2544	Facebook	Social	4.1	78158306.0	0.0	1000000000	Free	0.0	Teen	Social	3-Aug-18	Varies with device	Varies with device	Free
381	WhatsApp Messenger	Communication	4.4	69119316.0	0.0	1000000000	Free	0.0	Everyone	Communication	3-Aug-18	Varies with device	Varies with device	Free

12. What category has the highest number of apps uploaded to the store?¶

In [40]:

df["Category"].value_counts()

Out[40]:

Category
Family                 1874
Game                    945
Tools                   827
Business                420
Medical                 395
Productivity            374
Personalization         374
Lifestyle               369
Finance                 345
Sports                  325
Communication           315
Health and fitness      288
Photography             281
News and magazines      254
Social                  239
Books and reference     221
Travel and local        219
Shopping                202
Dating                  170
Video players           164
Maps and navigation     131
Food and drink          112
Education               105
Entertainment            86
Auto and vehicles        85
Libraries and demo       83
Weather                  79
House and home           73
Events                   64
Parenting                60
Art and design           60
Comics                   56
Beauty                   53
Name: count, dtype: int64

13. To which category belongs the most expensive app?¶

In [41]:

df.sort_values(by=["Price"], ascending=False).head(2)

Out[41]:

	App	Category	Rating	Reviews	Size	Installs	Type	Price	Content Rating	Genres	Last Updated	Current Ver	Android Ver	Distribution
4367	I'm Rich - Trump Edition	Lifestyle	3.6	275.0	7654604.8	10000	Paid	400.00	Everyone	Lifestyle	3-May-18	1.0.1	4.1 and up	Paid
5356	I Am Rich Premium	Finance	4.1	1867.0	4928307.2	50000	Paid	399.99	Everyone	Finance	12-Nov-17	1.6	4.0 and up	Paid

14. What's the name of the most expensive game?¶

In [44]:

df.loc[df["Category"] == "Game"].sort_values(by=["Price"], ascending=False).head(2)

Out[44]:

	App	Category	Rating	Reviews	Size	Installs	Type	Price	Content Rating	Genres	Last Updated	Current Ver	Android Ver	Distribution
4203	The World Ends With You	Game	4.6	4108.0	13631488.0	10000	Paid	17.99	Everyone 10+	Arcade	14-Dec-15	1.0.4	4.0 and up	Paid
10782	Trine 2: Complete Story	Game	3.8	252.0	11534336.0	10000	Paid	16.99	Teen	Action	27-Feb-15	2.22	5.0 and up	Paid

15. Which is the most popular Finance App?¶

In [43]:

df.loc[df["Category"] == "Finance"].sort_values(by=["Installs"], ascending=False).head(2)

Out[43]:

	App	Category	Rating	Reviews	Size	Installs	Type	Price	Content Rating	Genres	Last Updated	Current Ver	Android Ver	Distribution
5601	Google Pay	Finance	4.2	348132.0	0.0	100000000	Free	0.0	Everyone	Finance	26-Jul-18	2.70.206190089	Varies with device	Free
1156	PayPal	Finance	4.3	659760.0	49283072.0	50000000	Free	0.0	Everyone	Finance	18-Jul-18	6.28.0	4.4 and up	Free

16. What Teen Game has the most reviews?¶

In [47]:

df.query("Category == 'Game' and `Content Rating` == 'Teen'").sort_values(by=["Reviews"], ascending=False).head(2)

Out[47]:

	App	Category	Rating	Reviews	Size	Installs	Type	Price	Content Rating	Genres	Last Updated	Current Ver	Android Ver	Distribution
3912	Asphalt 8: Airborne	Game	4.5	8389714.0	96468992.0	100000000	Free	0.0	Teen	Racing	4-Jul-18	3.7.1a	4.0.3 and up	Free
5417	Mobile Legends: Bang Bang	Game	4.4	8219586.0	103809024.0	100000000	Free	0.0	Teen	Action	24-Jul-18	1.2.97.3042	4.0.3 and up	Free

17. Which is the free game with the most reviews?¶

In [48]:

df.query("Distribution == 'Free' and Category == 'Game'").sort_values(by="Reviews", ascending=False).head(2)

Out[48]:

	App	Category	Rating	Reviews	Size	Installs	Type	Price	Content Rating	Genres	Last Updated	Current Ver	Android Ver	Distribution
1879	Clash of Clans	Game	4.6	44893888.0	102760448.0	100000000	Free	0.0	Everyone 10+	Strategy	15-Jul-18	10.322.16	4.1 and up	Free
1917	Subway Surfers	Game	4.5	27725352.0	79691776.0	1000000000	Free	0.0	Everyone 10+	Arcade	12-Jul-18	1.90.0	4.1 and up	Free

18. How many Tb (tebibytes) were transferred (overall) for the most popular Lifestyle app?¶

In [52]:

app = df.query("Category == 'Lifestyle'").sort_values(by=["Installs"], ascending=False).iloc[0]
app

Out[52]:

App                   Tinder
Category           Lifestyle
Rating                   4.0
Reviews            2789775.0
Size              71303168.0
Installs           100000000
Type                    Free
Price                    0.0
Content Rating    Mature 17+
Genres             Lifestyle
Last Updated        2-Aug-18
Current Ver            9.5.0
Android Ver       4.4 and up
Distribution            Free
Name: 4587, dtype: object

In [53]:

(app["Installs"] * app["Size"]) / (1024 * 1024 * 1024 * 1024)

Out[53]:

6484.9853515625

In [ ]:

Statement of Completion#d662984e

Data Cleaning with Pandas

Capstone Project: Cleaning Google Playstore data

Data Cleaning¶

1. Which of the following column(s) has/have null values?¶

2. Clean the `Rating` column and the other columns containing null values¶

3. Clean the column `Reviews` and make it numeric¶

4. How many duplicated apps are there?¶

5. Drop duplicated apps keeping the ones with the greatest number of reviews¶

6. Format the `Category` column¶

7. Clean and convert the `Installs` column to numeric type¶

8. Clean and convert the `Size` column to numeric (representing bytes)¶

9. Clean and convert the `Price` column to numeric¶

10. Paid or free?¶

Analysis¶

11. Which app has the most reviews?¶

12. What category has the highest number of apps uploaded to the store?¶

13. To which category belongs the most expensive app?¶

14. What's the name of the most expensive game?¶

15. Which is the most popular Finance App?¶

16. What Teen Game has the most reviews?¶

17. Which is the free game with the most reviews?¶

18. How many Tb (tebibytes) were transferred (overall) for the most popular Lifestyle app?¶

Statement of Completion#d662984e

Data Cleaning with Pandas

Capstone Project: Cleaning Google Playstore data

Data Cleaning¶

1. Which of the following column(s) has/have null values?¶

2. Clean the Rating column and the other columns containing null values¶

3. Clean the column Reviews and make it numeric¶

4. How many duplicated apps are there?¶

5. Drop duplicated apps keeping the ones with the greatest number of reviews¶

6. Format the Category column¶

7. Clean and convert the Installs column to numeric type¶

8. Clean and convert the Size column to numeric (representing bytes)¶

9. Clean and convert the Price column to numeric¶

10. Paid or free?¶

Analysis¶

11. Which app has the most reviews?¶

12. What category has the highest number of apps uploaded to the store?¶

13. To which category belongs the most expensive app?¶

14. What's the name of the most expensive game?¶

15. Which is the most popular Finance App?¶

16. What Teen Game has the most reviews?¶

17. Which is the free game with the most reviews?¶

18. How many Tb (tebibytes) were transferred (overall) for the most popular Lifestyle app?¶

2. Clean the `Rating` column and the other columns containing null values¶

3. Clean the column `Reviews` and make it numeric¶

6. Format the `Category` column¶

7. Clean and convert the `Installs` column to numeric type¶

8. Clean and convert the `Size` column to numeric (representing bytes)¶

9. Clean and convert the `Price` column to numeric¶