Statement of Completion#2b21bc89
Intro to Pandas for Data Analysis
medium
Practice Data Filtering & Sorting with Hollywood Movie Data
Resolution
Activities
Practice Data Filtering & Sorting with Hollywood Movie Data¶
Explore¶
Let's explore our dataset
in our lab
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
df = pd.read_csv('data/movies_dataset.csv', low_memory=False)
In [3]:
df.head()
Out[3]:
title | adult | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | ... | runtime | spoken_languages | status | tagline | video | vote_average | vote_count | production_companies_name | production_companies_id | production_countries_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GoldenEye | True | 58000000 | Adventure, Action, Thriller | http://www.mgm.com/view/movie/757/Goldeneye/ | 710 | tt0113189 | English | GoldenEye | James Bond must unmask the mysterious head of ... | ... | 130.0 | English, Pусский, Español | Released | No limits. No fears. No substitutes. | False | 6.6 | 1194.0 | United Artists, Eon Productions | 60, 7576 | United Kingdom, United States of America |
1 | Friday | True | 3500000 | Comedy | http://www.newline.com/properties/friday.html | 10634 | tt0113118 | English | Friday | Craig and Smokey are two guys in Los Angeles h... | ... | 91.0 | English | Released | A lot can go down between thursday and saturda... | False | 7.0 | 513.0 | New Line Cinema | 12 | United States of America |
2 | From Dusk Till Dawn | True | 19000000 | Horror, Action, Thriller, Crime | http://www.miramax.com/movie/from-dusk-till-dawn/ | 755 | tt0116367 | English | From Dusk Till Dawn | Seth Gecko and his younger brother Richard are... | ... | 108.0 | English, Español | Released | One night is all that stands between them and ... | False | 6.9 | 1644.0 | A Band Apart, Dimension Films, Los Hooligans P... | 59, 7405, 11705, 53009 | United States of America |
3 | Blue in the Face | True | 2000000 | Comedy | http://miramax.com/movie/blue-in-the-face/ | 5894 | tt0112541 | English | Blue in the Face | Auggie runs a small tobacco shop in Brooklyn, ... | ... | 83.0 | English | Released | Welcome to the planet Brooklyn. | False | 6.8 | 28.0 | Miramax Films, InterAL | 14, 30334 | United States of America |
4 | Mighty Morphin Power Rangers: The Movie | True | 15000000 | Action, Adventure, Science Fiction, Family, Fa... | http://www.powerrangers.com/ | 9070 | tt0113820 | English | Mighty Morphin Power Rangers: The Movie | Power up with six incredible teens who out-man... | ... | 92.0 | English | Released | The Power Is On! | False | 5.2 | 153.0 | Twentieth Century Fox Film Corporation, Saban ... | 306, 2106, 9255 | Japan, United States of America |
5 rows × 24 columns
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 693 entries, 0 to 692 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 693 non-null object 1 adult 693 non-null bool 2 budget 693 non-null int64 3 genres 693 non-null object 4 homepage 693 non-null object 5 id 693 non-null int64 6 imdb_id 693 non-null object 7 original_language 693 non-null object 8 original_title 693 non-null object 9 overview 693 non-null object 10 popularity 693 non-null float64 11 poster_path 693 non-null object 12 release_date 693 non-null object 13 revenue 693 non-null float64 14 runtime 693 non-null float64 15 spoken_languages 691 non-null object 16 status 693 non-null object 17 tagline 693 non-null object 18 video 693 non-null bool 19 vote_average 693 non-null float64 20 vote_count 693 non-null float64 21 production_companies_name 662 non-null object 22 production_companies_id 662 non-null object 23 production_countries_name 682 non-null object dtypes: bool(2), float64(5), int64(2), object(15) memory usage: 120.6+ KB
1) Find the movie with the maximum runtime
.¶
In [10]:
df['runtime'].max()
Out[10]:
476.0
2) What is the earliest Release Date?¶
In [7]:
df['release_date'].min()
Out[7]:
'1940-11-13'
3) What's the highest value in the vote_count
column?¶
In [11]:
df['vote_count'].max()
Out[11]:
12269.0
4) Select the first 5 movies¶
In [17]:
first_5_movies = df.iloc[:5]
5) What is the title
of the 10th movie?¶
In [19]:
df.iloc[9]
Out[19]:
title Jurassic Park adult True budget 63000000 genres Adventure, Science Fiction homepage http://www.jurassicpark.com/ id 329 imdb_id tt0107290 original_language English original_title Jurassic Park overview A wealthy entrepreneur secretly creates a them... popularity 8.863776 poster_path /c414cDeQ9b6qLPLeKmiJuLDUREJ.jpg release_date 1993-06-11 revenue 920100000.0 runtime 127.0 spoken_languages English, Español status Released tagline An adventure 65 million years in the making. video False vote_average 7.6 vote_count 4956.0 production_companies_name Universal Pictures, Amblin Entertainment production_companies_id 33, 56 production_countries_name United States of America Name: 9, dtype: object
6) Select the title
and release_date
of the movies in positions 2nd, 6th and 11th¶
In [22]:
# Select specific rows and columns from the DataFrame 'df'
selected_movies = df.iloc[[1,5,10],[0,12]]
7) Sort the df
by release_date
in ascending order.¶
In [24]:
# Sort the DataFrame 'df' by the 'release_date' column in ascending order
df = df.sort_values('release_date')
8) Select movies in English and sort them by Vote Average¶
In [26]:
# Get English movies sorted by highest rating
df_english_movies = df.loc[df['original_language']=='English'].sort_values('vote_average',ascending=False)
9) Select movies or series with long runtime and high rating¶
In [30]:
# Get movies with long runtime and high rating
df_runtime_above_120 = df.loc[(df['runtime'] > 120) & (df['vote_average'] > 7)]
10) Sort by genres
(asc) and revenue
(desc)¶
In [32]:
# Sort the df by 'genres' in ascending order and 'revenue' in descending order
df_genres_revenue = df.sort_values(['genres','revenue'],ascending=[True,False])
11) Select movies or series that have the Action
genre and sort them by voting average and revenue¶
In [36]:
# Filter the DataFrame to include only movies with 'Action' genre
df_action = df.loc[df['genres'].str.contains('Action')].sort_values(['vote_average','revenue'],ascending=[False,True])
12) Select movies or series with a budget > 50,000,000
and runtime > 120 minutes
¶
Let's visualize the budget
column. We can see the green line, which represents the median
(50th percentile)
of the data. The individual circles above the top whisker represent outliers, which are data points that lie more than 1.5
times the interquartile range (IQR)
from the 75th
percentile.
In [38]:
df['budget'].plot(kind='box')
Out[38]:
<Axes: >
In [39]:
# Create a new DataFrame 'df_bug_run' with filtered rows and selected columns
df_budget_runtime = df.loc[(df['budget'] > 50000000) & (df['runtime'] > 120),('original_title','budget','runtime')]
13) Find movies/series that contain the Action
and Adventure
genres with a high vote average, and sort them by vote average (desc) and revenue (asc)¶
In [46]:
# Filter the df with 'Action' or 'Adventure' genre and a vote_average greater than 7 then sort the df by 'vote_average' in descending order and 'revenue' in ascending order
df.loc[df['genres'].str.contains('Action|Adventure')].sort_values(['vote_average','revenue'],ascending=[False,True])
Out[46]:
title | adult | budget | genres | homepage | id | imdb_id | original_language | original_title | overview | ... | runtime | spoken_languages | status | tagline | video | vote_average | vote_count | production_companies_name | production_companies_id | production_countries_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
237 | The Dark Knight | True | 185000000 | Drama, Action, Crime, Thriller | http://thedarkknight.warnerbros.com/dvdsite/ | 155 | tt0468569 | English | The Dark Knight | Batman raises the stakes in his war on crime. ... | ... | 152.0 | English, 普通话 | Released | Why So Serious? | False | 8.3 | 12269.0 | DC Comics, Legendary Pictures, Warner Bros., D... | 429, 923, 6194, 9993, 9996 | United Kingdom, United States of America |
27 | The Empire Strikes Back | True | 18000000 | Adventure, Action, Science Fiction | http://www.starwars.com/films/star-wars-episod... | 1891 | tt0080684 | English | The Empire Strikes Back | The epic saga continues as Luke Skywalker, in ... | ... | 124.0 | English | Released | The Adventure Continues... | False | 8.2 | 5998.0 | Lucasfilm, Twentieth Century Fox Film Corporation | 1, 306 | United States of America |
6 | Star Wars | True | 11000000 | Adventure, Action, Science Fiction | http://www.starwars.com/films/star-wars-episod... | 11 | tt0076759 | English | Star Wars | Princess Leia is captured and held hostage by ... | ... | 121.0 | English | Released | A long time ago in a galaxy far, far away... | False | 8.1 | 6778.0 | Lucasfilm, Twentieth Century Fox Film Corporation | 1, 306 | United States of America |
132 | The Lord of the Rings: The Return of the King | True | 94000000 | Adventure, Fantasy, Action | http://www.lordoftherings.net | 122 | tt0167260 | English | The Lord of the Rings: The Return of the King | Aragorn is revealed as the heir to the ancient... | ... | 201.0 | English | Released | The eye of the enemy is moving. | False | 8.1 | 8226.0 | WingNut Films, New Line Cinema | 11, 12 | New Zealand, United States of America |
33 | Back to the Future | True | 19000000 | Adventure, Comedy, Science Fiction, Family | http://www.backtothefuture.com/movies/backtoth... | 105 | tt0088763 | English | Back to the Future | Eighties teenager Marty McFly is accidentally ... | ... | 116.0 | English | Released | He's the only kid ever to get into trouble bef... | False | 8.0 | 6239.0 | Universal Pictures, Amblin Entertainment, U-Dr... | 33, 56, 20448 | United States of America |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
171 | BloodRayne | True | 25000000 | Action, Adventure, Fantasy, Horror | http://www.bloodrayne-themovie.com/main/index.... | 168705 | tt0383222 | English | BloodRayne | In eighteenth century Romania, Rayne, a dhampi... | ... | 95.0 | English, Português | Released | Revenge never tasted so sweet. | False | 3.6 | 120.0 | Boll KG | 15127 | United States of America |
655 | Birdemic 2: The Resurrection | True | 0 | Action, Comedy, Horror, Science Fiction | https://chill.com/james_nguyen/birdemic2 | 188489 | tt1674047 | English | Birdemic 2: The Resurrection | A platoon of eagles and vultures attack Hollyw... | ... | 82.0 | English | Released | you asked for it | False | 3.2 | 20.0 | I Got a Fish Productions, Moviehead Pictures | 23389, 23390 | United States of America |
624 | Mega Shark vs. Mecha Shark | True | 0 | Action, Science Fiction, Thriller | http://www.theasylum.cc/product.php?id=242 | 246594 | tt3152098 | English | Mega Shark vs. Mecha Shark | When another Mega Shark returns from the depth... | ... | 90.0 | English | Released | Monster vs Metal | False | 3.1 | 29.0 | Asylum, The | 3991 | United States of America |
126 | House of the Dead | True | 7000000 | Horror, Action, Thriller | http://www.house-of-the-dead.com/ | 11059 | tt0317676 | English | House of the Dead | Set on an island off the coast, a techno rave ... | ... | 90.0 | English | Released | You won't last the night. | False | 2.8 | 106.0 | Boll Kino Beteiligungs GmbH & Co. KG, Brightli... | 1062, 1063, 1064, 8068 | Canada, Germany, United States of America |
417 | Mega Shark vs. Crocosaurus | True | 0 | Action, Horror, Science Fiction | http://www.theasylum.cc/product.php?id=180 | 52454 | tt1705773 | English | Mega Shark vs. Crocosaurus | When the prehistoric warm-water beast the Croc... | ... | 90.0 | English, Français, Español | Released | Whoever wins... we lose! | False | 2.7 | 39.0 | The Asylum | 1311 | United States of America |
381 rows × 24 columns
14) Select movies or series that have Star Wars
in the title
and sort them by release_date
(asc) and revenue
(desc).¶
In [47]:
# Filter the df with 'Star Wars' in the original title and sort them by release_date' in ascending order and 'revenue' in descending order
df_star_wars = df.loc[df['title'].str.contains('Star Wars')].sort_values(['release_date','revenue'],ascending=[True,False])
15) Select original_title
, budget
, revenue
, and vote_average
of the rows with a budget greater than 100M.¶
In [49]:
# Filter DataFrame for high-budget movies and select relevant columns
df_high_budget = df.loc[df['budget'] > 100000000,('original_title','budget','revenue','vote_average')]
16) Find the movies with a runtime greater than 120 and a vote average greater than 7. Sort them in descending order.¶
Run the cell below to see the histogram, which shows a right-skewed distribution of movie runtimes. The highest frequency is around 130 minutes. This indicates a predominance of movies around 2 hours long. Very short or very long movies are relatively rare.
In [51]:
df['runtime'].plot(kind='hist', bins=20, figsize=(10, 6), title='Distribution of Movie Runtimes');
In [52]:
df_run_vote = df.loc[(df['runtime'] > 120) & (df['vote_average'] > 7)].sort_values('runtime',ascending=False)
17) Find the top 10 movies or series with highest voting count and a voting average greater than 7 and select original_title
, vote_count
, and vote_average
columns.¶
In [56]:
df_top_10 = df.loc[(df['vote_count'] > 1000) & (df['vote_average'] > 7),('original_title','vote_count','vote_average')].head(10)
18) Find the movies or series that have a vote_average
above 8 or have made more than 500,000,000 in revenue
.¶
In [58]:
# Filter the DataFrame to include movies with vote_average > 8 or revenue > $500,000,000
df_high = df.loc[(df['vote_average'] > 8) | (df['revenue'] > 500000000)]
19) Enter the release date of the movie that is not in English
.¶
Run the cell below to see the bar plot, there are many movies and series in the English language, but few are non-English, including Hindi, Japanese, Chinese, etc.
In [ ]:
df['original_language'].value_counts().plot(kind='bar', title='Count of Movies by Language', figsize=(10, 5));
In [62]:
df.loc[df['original_language'] != 'English','release_date'].min()
Out[62]:
'1961-01-23'
20) Select movies with runtime > 150
, vote_average > 7
, popularity > 20
, and sort them by descending order.¶
In [63]:
df_long_movies = df.loc[(df['runtime'] > 150) & (df['vote_average'] > 7) & (df['popularity'] > 20)].sort_values('vote_average',ascending=False)
21) Find movies in the Action, Adventure, or Science Fiction genres and sort them by descending order¶
In [65]:
df_action_sci_adv = df.loc[df['genres'].str.contains('Action|Adventure|Science Fiction')].sort_values('runtime',ascending=False)
22) Find movies with a budget under 20M and revenue over 100M and sort them by descending order.¶
In [67]:
df_budget_revenue = df.loc[(df['budget'] < 20000000) & (df['revenue'] > 100000000)].sort_values('revenue',ascending=False)
In [ ]: