Statement of Completion#632909ed
Data Wrangling with Pandas
easy
Practice GroupBy operations with Video Games Sales data
Resolution
Activities
Project.ipynb
Importing the libraries¶
In [1]:
#to read the dataframe
import pandas as pd
Reading the dataset as a dataframe¶
In [2]:
# Reading dataframes
games_sales_df = pd.read_csv('vgsales.csv')
In [3]:
games_sales_df.head()
Out[3]:
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
1 | 2 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
2 | 3 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
3 | 4 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
4 | 5 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
Discovering the dataset¶
In [4]:
#show the number of cells in the dataframe
print("video games dataframe size: ", games_sales_df.size)
#show the number of records (rows) in the dataframe
print("number of games: ", len(games_sales_df))
video games dataframe size: 182578 number of games: 16598
In [5]:
games_sales_df.dtypes
Out[5]:
Rank int64 Name object Platform object Year float64 Genre object Publisher object NA_Sales float64 EU_Sales float64 JP_Sales float64 Other_Sales float64 Global_Sales float64 dtype: object
Activities¶
Let's enjoy wrangling video games data.
1. Calculate the total global sales for each Platform
.¶
In [15]:
total_global_sales_by_platform = games_sales_df.groupby('Platform')['Global_Sales'].sum()
total_global_sales_by_platform
Out[15]:
Platform 2600 97.08 3DO 0.10 3DS 247.46 DC 15.97 DS 822.49 GB 255.45 GBA 318.50 GC 199.36 GEN 28.36 GG 0.04 N64 218.88 NES 251.07 NG 1.44 PC 258.82 PCFX 0.03 PS 730.66 PS2 1255.64 PS3 957.84 PS4 278.10 PSP 296.28 PSV 61.93 SAT 33.59 SCD 1.87 SNES 200.05 TG16 0.16 WS 1.42 Wii 926.71 WiiU 81.86 X360 979.96 XB 258.26 XOne 141.06 Name: Global_Sales, dtype: float64
2. Find the top-selling genre based on global sales.¶
In [14]:
top_selling_genre = games_sales_df.groupby('Genre')['Global_Sales'].max()
top_selling_genre
Out[14]:
Genre Action 21.40 Adventure 11.18 Fighting 13.04 Misc 29.02 Platform 40.24 Puzzle 30.26 Racing 35.82 Role-Playing 31.37 Shooter 28.31 Simulation 24.76 Sports 82.74 Strategy 5.45 Name: Global_Sales, dtype: float64
3. Create a bar chart to visualize the sum of Global_Sales
per Genre
.¶
In [17]:
global_sales_chart = games_sales_df.groupby('Genre')['Global_Sales'].sum().sort_values(ascending=False).plot(kind='bar')
4. Calculate the average global sales for each Publisher
.¶
In [25]:
avg_sales_per_publisher = games_sales_df.groupby('Publisher')['Global_Sales'].mean().reset_index()
avg_sales_per_publisher
Out[25]:
Publisher | Global_Sales | |
---|---|---|
0 | 10TACLE Studios | 0.036667 |
1 | 1C Company | 0.033333 |
2 | 20th Century Fox Video Games | 0.388000 |
3 | 2D Boy | 0.040000 |
4 | 3DO | 0.281111 |
... | ... | ... |
573 | id Software | 0.030000 |
574 | imageepoch Inc. | 0.020000 |
575 | inXile Entertainment | 0.100000 |
576 | mixi, Inc | 0.860000 |
577 | responDESIGN | 0.065000 |
578 rows × 2 columns
5. Calculate the total EU_Sales
for each platform-year combination.¶
In [30]:
platform_year_eu_sales = games_sales_df.groupby(['Platform','Year'])['EU_Sales'].sum()
platform_year_eu_sales
Out[30]:
Platform Year 2600 1980.0 0.67 1981.0 1.96 1982.0 1.65 1983.0 0.34 1984.0 0.01 ... XB 2008.0 0.04 XOne 2013.0 5.29 2014.0 17.75 2015.0 17.98 2016.0 4.63 Name: EU_Sales, Length: 241, dtype: float64
6. Find the number of publishers in each Platform
.¶
In [33]:
no_publisher_per_platform = games_sales_df.groupby('Platform')['Publisher'].apply(lambda x: x.nunique()).reset_index(name='Publishers_Count')
no_publisher_per_platform
Out[33]:
Platform | Publishers_Count | |
---|---|---|
0 | 2600 | 26 |
1 | 3DO | 3 |
2 | 3DS | 81 |
3 | DC | 15 |
4 | DS | 175 |
5 | GB | 17 |
6 | GBA | 87 |
7 | GC | 52 |
8 | GEN | 7 |
9 | GG | 1 |
10 | N64 | 54 |
11 | NES | 12 |
12 | NG | 3 |
13 | PC | 130 |
14 | PCFX | 1 |
15 | PS | 151 |
16 | PS2 | 172 |
17 | PS3 | 104 |
18 | PS4 | 67 |
19 | PSP | 127 |
20 | PSV | 67 |
21 | SAT | 44 |
22 | SCD | 1 |
23 | SNES | 50 |
24 | TG16 | 2 |
25 | WS | 2 |
26 | Wii | 113 |
27 | WiiU | 23 |
28 | X360 | 101 |
29 | XB | 73 |
30 | XOne | 40 |
7. Find the publisher with the most occurences for each Platform
.¶
In [41]:
platform_publisher_count = games_sales_df.groupby(['Platform', 'Publisher']).size()
most_frequent_publisher_platform = platform_publisher_count.groupby('Platform').apply(lambda x: x.index[x == x.max()].min()[1]).reset_index(name='Publisher')
most_frequent_publisher_platform
Out[41]:
Platform | Publisher | |
---|---|---|
0 | 2600 | Atari |
1 | 3DO | Imageworks |
2 | 3DS | Namco Bandai Games |
3 | DC | Sega |
4 | DS | Ubisoft |
5 | GB | Nintendo |
6 | GBA | THQ |
7 | GC | Electronic Arts |
8 | GEN | Sega |
9 | GG | Sega |
10 | N64 | Nintendo |
11 | NES | Nintendo |
12 | NG | SNK |
13 | PC | Electronic Arts |
14 | PCFX | NEC |
15 | PS | Sony Computer Entertainment |
16 | PS2 | Sony Computer Entertainment |
17 | PS3 | Electronic Arts |
18 | PS4 | Namco Bandai Games |
19 | PSP | Namco Bandai Games |
20 | PSV | Namco Bandai Games |
21 | SAT | Sega |
22 | SCD | Sega |
23 | SNES | Nintendo |
24 | TG16 | Hudson Soft |
25 | WS | Namco Bandai Games |
26 | Wii | Ubisoft |
27 | WiiU | Nintendo |
28 | X360 | Electronic Arts |
29 | XB | Electronic Arts |
30 | XOne | Electronic Arts |
8. Find the publisher with the highest total global sales for each Year
.¶
In [45]:
yearly_publisher_sales = games_sales_df.groupby(['Year', 'Publisher'])['Global_Sales'].sum()
top_publisher_year = yearly_publisher_sales.groupby('Year').idxmax().apply(lambda x: x[1]).reset_index(name='Top_Selling_Publisher')
top_publisher_year['Global_Sales_Sum'] = top_publisher_year.apply(lambda row: yearly_publisher_sales.loc[row['Year'], row['Top_Selling_Publisher']], axis=1)
9. Find the maximum sales year for each Genre
.¶
In [47]:
# Grouping the game sales DataFrame by 'Genre' and 'Year' and summing the 'Global_Sales' for each group.
# Reset the index to convert the grouped indices into columns.
genre_year_sales = games_sales_df.groupby(['Genre', 'Year'])['Global_Sales'].sum().reset_index()
# For each genre, find the index of the year with the maximum 'Global_Sales'.
# This helps in identifying the year of peak sales for each genre.
index_of_max_sales_per_genre = genre_year_sales.groupby('Genre')['Global_Sales'].idxmax()
# Extract the rows corresponding to the maximum sales year for each genre using the indices found.
# This DataFrame will contain the 'Genre', 'Year', and 'Global_Sales' of the year with the highest sales per genre.
max_sales_year_per_genre = genre_year_sales.loc[index_of_max_sales_per_genre]
10. Compute descriptive statistics for each Genre
using its Global_Sales
.¶
In [49]:
descriptive_genres = games_sales_df.groupby('Genre')['Global_Sales'].describe()
11. Calculate the total JP_Sales
and average Other_Sales
for each Platform
.¶
In [51]:
jp_other_sales_platform = games_sales_df.groupby('Platform').agg({'JP_Sales': 'sum', 'Other_Sales': 'mean'})
12. Calculate the range (max-min) of NA_Sales
for each Publisher
.¶
In [53]:
def sales_range(x):
return x.max() - x.min()
max_min_diff_NAsales_per_publisher = games_sales_df.groupby('Publisher')['NA_Sales'].agg(sales_range)
13. Calculate the percentage of Global_Sales
contribution by each Platform
.¶
In [56]:
sales_by_platform =games_sales_df.groupby('Platform')['Global_Sales'].sum()
sales_percentage_by_platform = sales_by_platform.apply(lambda x: (x/sales_by_platform.sum())*100).reset_index(name='sales_percentage')
14. Which platforms for playing video games have been the most popular each year?¶
In [58]:
popular_platform_per_year = games_sales_df.groupby('Year')['Platform'].apply(lambda x: x.value_counts().idxmax())
15. Which genres have been more or less popular with the passing of time?¶
In [60]:
genre_popularity_over_time = games_sales_df.groupby('Year')['Genre'].value_counts().unstack(fill_value=0)
16. Calculate the z-score normalization for Global_Sales
for each Genre
.¶
In [63]:
games_sales_df['Global_Sales_Normalized'] = games_sales_df.groupby('Genre')['Global_Sales'].transform(lambda x: (x - x.mean()) / x.std())
In [ ]: