Statement of Completion#7fde81d0
Intro to Pandas for Data Analysis
medium
Explore, Filter, Sort and Analyse Cricket Sports Data Using Pandas
Resolution
Activities
Pitch Perfect Analysis: Cricket Stats with Pandas¶
Setting the Stage¶
In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv("preprocessed_data.csv")
In [2]:
# Take a peek at the data
df.head()
Out[2]:
Year | Player_Name | Matches_Batted | Not_Outs | Runs_Scored | Highest_Score | Batting_Average | Balls_Faced | Batting_Strike_Rate | Centuries | ... | Matches_Bowled | Balls_Bowled | Runs_Conceded | Wickets_Taken | Best_Bowling_Match | Bowling_Average | Economy_Rate | Bowling_Strike_Rate | Four_Wicket_Hauls | Five_Wicket_Hauls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2024 | Ruturaj Gaikwad | 2 | 0 | 61 | 46.0 | 30.50 | 51 | 119.61 | 0 | ... | 2 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
1 | 2023 | Ruturaj Gaikwad | 16 | 1 | 590 | 92.0 | 42.14 | 400 | 147.50 | 0 | ... | 16 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
2 | 2022 | Ruturaj Gaikwad | 14 | 0 | 368 | 99.0 | 26.29 | 291 | 126.46 | 0 | ... | 14 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
3 | 2021 | Ruturaj Gaikwad | 16 | 2 | 635 | NaN | 45.35 | 466 | 136.26 | 1 | ... | 16 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
4 | 2020 | Ruturaj Gaikwad | 6 | 2 | 204 | 72.0 | 51.00 | 169 | 120.71 | 0 | ... | 6 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
5 rows × 25 columns
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1130 entries, 0 to 1129 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 1130 non-null int64 1 Player_Name 1130 non-null object 2 Matches_Batted 1130 non-null int64 3 Not_Outs 1130 non-null int64 4 Runs_Scored 1130 non-null int64 5 Highest_Score 700 non-null float64 6 Batting_Average 1130 non-null float64 7 Balls_Faced 1130 non-null int64 8 Batting_Strike_Rate 1130 non-null float64 9 Centuries 1130 non-null int64 10 Half_Centuries 1130 non-null int64 11 Fours 1130 non-null int64 12 Sixes 1130 non-null int64 13 Catches_Taken 1130 non-null int64 14 Stumpings 1130 non-null int64 15 Matches_Bowled 1130 non-null int64 16 Balls_Bowled 1130 non-null int64 17 Runs_Conceded 1130 non-null int64 18 Wickets_Taken 1130 non-null int64 19 Best_Bowling_Match 419 non-null float64 20 Bowling_Average 1130 non-null float64 21 Economy_Rate 1130 non-null float64 22 Bowling_Strike_Rate 1130 non-null float64 23 Four_Wicket_Hauls 1130 non-null int64 24 Five_Wicket_Hauls 1130 non-null int64 dtypes: float64(7), int64(17), object(1) memory usage: 220.8+ KB
In [4]:
#df.describe()
Warm Up Activities¶
In [6]:
# Run the cell below and find out which players we'll be expecting to see 🙌🏻
In [7]:
len(df['Player_Name'].unique())
Out[7]:
206
1. Know the players!¶
In [ ]:
# Write your code here
2. Find the Maximum Runs Scored¶
In [8]:
df['Runs_Scored'].max()
Out[8]:
973
3. How old is the dataset?¶
In [11]:
len(df['Year'].unique())
Out[11]:
17
4. What is the maximum Batting Strike Rate value?¶
In [12]:
df['Batting_Strike_Rate'].max()
Out[12]:
400.0
Basic Filtering with loc[]
and Sorting with sort_values()
¶
5. Let's filter out players who played in the 2023!¶
In [13]:
players_2023 = df.loc[df['Year']==2023]
6. Top 5 Players with Max Not Outs¶
In [17]:
top_not_outs = df.sort_values('Not_Outs',ascending=False).head(5)
Just for exploration!¶
Analyzing High Scorers of the Season¶
Before we dive into the details of players scoring the most runs in 2023, let’s take a broader look at the run distribution over the years.
A histogram will help us understand in which years players scored more freely, potentially reflecting changes in playing conditions or rules.
In [ ]:
# Run the below cell to see the high run scorers of the season!
In [19]:
# Histogram of runs scored by year
plt.figure(figsize=(10, 6))
df.groupby('Year')['Runs_Scored'].sum().plot(kind='bar', color='lightcoral')
plt.title('Total Runs Scored Per Year')
plt.xlabel('Year')
plt.ylabel('Total Runs')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()
7. Top 5 Run Scorers of 2023¶
In [26]:
top_scorers_2023 = players_2023.sort_values('Runs_Scored',ascending=False).head(5)
8. Selecting Players with High Batting Average¶
In [28]:
high_average_players = df.loc[df['Batting_Average'] > 50]
Just for Exploration¶
The year 2023 was a real treat for cricket fans – it saw the highest number of centuries! Was it the pitches, the bats, or just plain skill?
Let’s check out a histogram of centuries made over the years and see how 2023 stood out
In [ ]:
# Run the below cell to find out the maximum number of centuries hit each season!
In [30]:
# Histogram of centuries by year
import matplotlib.pyplot as plt
centuries_per_year = df.groupby('Year')['Centuries'].sum()
plt.figure(figsize=(10, 6))
centuries_per_year.plot(kind='bar', color='orange')
plt.title('Centuries Recorded Per Year')
plt.xlabel('Year')
plt.ylabel('Total Centuries')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Multi Condition Filtering¶
9. Compound Filtering for Centuries and Wickets¶
In [32]:
century_and_wickets = df.loc[(df['Centuries'] >= 1) & (df['Wickets_Taken'] >= 3)]
10. Recent High Scores¶
In [34]:
high_scores_2021 = df.loc[(df['Year'] == 2021) & (df['Runs_Scored'] > 50)]
11. Filter out the Efficient Strikers from the dataset¶
The distribution of batting strike rates is heavily concentrated around the median of approximately 125. However, there are exceptional instances where the batting strike rates far exceed the average, reaching as high as 350 and beyond. These outliers represent the top performers, indicating a highly aggressive batting style.
Here is the distribution of batting strike rates, with a reference line at the value 250, highlighting these extraordinary performances
In [ ]:
# Run the below cell for distribution
In [36]:
# Boxplot of batting strike rates
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['Batting_Strike_Rate'])
plt.title('Boxplot of Batting Strike Rates')
plt.axvline(x=250, color='red', linestyle='--', label='Reference Line')
plt.xlabel('Batting Strike Rate')
plt.show()
In [37]:
efficient_strikers = df.loc[(df['Batting_Strike_Rate'] > 150) & (df['Runs_Scored'] > 200)]
12. Filtering with Multiple Conditions¶
In [39]:
specific_players = df.loc[(df['Runs_Scored'] > 300) & (df['Wickets_Taken'] == 0)]
13. Who was the First Player to Achieve a Five Wicket Haul?!¶
In [48]:
first_five_wicket_haul = df.loc[df['Five_Wicket_Hauls'] == 1].nsmallest(1,'Year')
Query Activities¶
14. Query for Five Wicket Hauls¶
In [49]:
five_wicket_hauls = df.query('Five_Wicket_Hauls == 1')
By the way , comment down below which is your fav IPL team this season!¶
In [ ]:
15. Filtering with Query¶
In [51]:
active_high_scorers = df.query('Matches_Batted > 15 & Runs_Scored > 500')
Multi Condition Sorting¶
16. Who is the Titan ?¶
In [55]:
sorted_by_runs_and_score = df.sort_values(['Runs_Scored','Highest_Score'],ascending=False)
17. How well do you understand the impact of economy rate changes over the years?¶
In [ ]:
Understanding Economy Rate¶
The economy rate is a key bowling metric that calculates the average runs conceded per over bowled—the lower the economy rate, the better the bowling!
From the below boxplot, it's clear that most bowlers consistently maintained a moderate economy rate over the years, with recent seasons like 2023 and 2024 showing a few instances of unusually high rates, hinting at some challenging games for the bowlers.
In [ ]:
# Run the below cell for understanding the 'Economy rate' over the seasons
In [57]:
df_clean = df.dropna(subset=['Economy_Rate'])
# Create a boxplot of economy rates across years
plt.figure(figsize=(14, 8))
sns.boxplot(x='Year', y='Economy_Rate', data=df_clean, palette='coolwarm')
plt.title('Boxplot of Economy Rates by Year')
plt.xlabel('Year')
plt.ylabel('Economy Rate')
plt.xticks(rotation=45)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
/tmp/ipykernel_18/1520780003.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='Year', y='Economy_Rate', data=df_clean, palette='coolwarm')
18. Sorting by Economy Rate for 2022¶
In [61]:
sorted_economy_2022 = df.loc[df['Year']==2022].sort_values('Economy_Rate')
Filtering using : isin()
, nsmallest()
, and nlargest()
¶
19. Filtering Players by Specific Match Conditions¶
In [63]:
years = [2021, 2022]
players_recent_years = df.loc[df['Year'].isin(years)]
Just for exploration¶
Let's plot the averages before we focus on players with extraordinary averages.
This line graph will help us see trends and anomalies in batting performances.
In [ ]:
# Run the below cell for understanding the 'Batting Averages' over the years
In [65]:
# Line plot of batting averages over the years
average_by_year = df.groupby('Year')['Batting_Average'].mean()
plt.figure(figsize=(10, 6))
average_by_year.plot(kind='line', marker='o', linestyle='-', color='navy')
plt.title('Average Batting Averages Over the Years')
plt.xlabel('Year')
plt.ylabel('Batting Average')
plt.grid(True)
plt.show()
20. Analyze Rohit Sharma's Matches Across Different Years¶
In [ ]:
In [81]:
rohit_matches = df.loc[df['Player_Name']=='Rohit Sharma',('Year','Matches_Batted')].sort_values('Year')
rohit_matches
Out[81]:
Year | Matches_Batted | |
---|---|---|
577 | 2008 | 13 |
576 | 2009 | 16 |
575 | 2010 | 16 |
574 | 2011 | 16 |
573 | 2012 | 17 |
572 | 2013 | 19 |
571 | 2014 | 15 |
570 | 2015 | 16 |
569 | 2016 | 14 |
568 | 2017 | 17 |
567 | 2018 | 14 |
566 | 2019 | 15 |
565 | 2020 | 12 |
564 | 2021 | 13 |
563 | 2022 | 14 |
562 | 2023 | 16 |
561 | 2024 | 1 |
Just for Exploration¶
Let’s use a violin plot to visualize the distribution of wickets taken across all matches, emphasizing performances in games where bowlers took more than five wickets.
The widest part of the plot indicates that most wickets taken fall around the 5 to 10 range, suggesting this is the most common performance for bowlers in significant matches.
The plot's long, thin tails extending towards the higher wicket counts, such as 20 and above, indicate these are rarer, exceptional performances.
In [ ]:
# Run the below cell
In [83]:
# Filter for high-performance games
high_wicket_games = df[df['Wickets_Taken'] >= 5]
# Violin plot for distribution of wickets in high-performance games
plt.figure(figsize=(12, 6))
sns.violinplot(x='Wickets_Taken', data=high_wicket_games)
plt.title('Distribution of Wickets Taken in High-Performance Games')
plt.xlabel('Wickets Taken')
plt.grid(True)
plt.show()
21. Who is the Purple Cap Winner of 2022?!¶
In [88]:
max_wickets_2022 = df.loc[df['Year'] == 2022].sort_values('Wickets_Taken',ascending=False).head(1)
22. Identify the bowler with the lowest economy rate¶
In [92]:
lowest_economy_rate = df.loc[(df['Wickets_Taken'] > 0) & (df['Economy_Rate'] > 0)].nsmallest(1,'Economy_Rate')
23. Best Bowling Performance of that Player¶
In [94]:
best_bowling_performance = df.loc[df['Player_Name']=='Ravichandran Ashwin'].nlargest(1,'Best_Bowling_Match')
Just for Exploration !¶
Every year we see a great number of new talents!
Let's plot a pie chart to understand the percentage of number of new debutants every year!
In [ ]:
# Run the below cell for understanding the % of debutants every season!
In [ ]:
df['Debut_Year'] = df.sort_values('Year').drop_duplicates('Player_Name')['Year']
debutants_count = df[df['Debut_Year'].notna()]['Debut_Year'].value_counts().sort_index()
plt.figure(figsize=(10, 8))
plt.pie(debutants_count, labels=debutants_count.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired(range(len(debutants_count))))
plt.title('Percentage of Debutants by Year')
plt.axis('equal')
plt.show()