Pitch Perfect Analysis: Cricket Stats with Pandas¶

Setting the Stage¶

In [1]:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv("preprocessed_data.csv")

In [2]:

# Take a peek at the data
df.head()

Out[2]:

	Year	Player_Name	Matches_Batted	Not_Outs	Runs_Scored	Highest_Score	Batting_Average	Balls_Faced	Batting_Strike_Rate	Centuries	...	Matches_Bowled
0	2024	Ruturaj Gaikwad	2	0	61	46.0	30.50	51	119.61	0	...	2
1	2023	Ruturaj Gaikwad	16	1	590	92.0	42.14	400	147.50	0	...	16
2	2022	Ruturaj Gaikwad	14	0	368	99.0	26.29	291	126.46	0	...	14
3	2021	Ruturaj Gaikwad	16	2	635	NaN	45.35	466	136.26	1	...	16
4	2020	Ruturaj Gaikwad	6	2	204	72.0	51.00	169	120.71	0	...	6

5 rows × 25 columns

In [5]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1130 entries, 0 to 1129
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Year                 1130 non-null   int64  
 1   Player_Name          1130 non-null   object 
 2   Matches_Batted       1130 non-null   int64  
 3   Not_Outs             1130 non-null   int64  
 4   Runs_Scored          1130 non-null   int64  
 5   Highest_Score        700 non-null    float64
 6   Batting_Average      1130 non-null   float64
 7   Balls_Faced          1130 non-null   int64  
 8   Batting_Strike_Rate  1130 non-null   float64
 9   Centuries            1130 non-null   int64  
 10  Half_Centuries       1130 non-null   int64  
 11  Fours                1130 non-null   int64  
 12  Sixes                1130 non-null   int64  
 13  Catches_Taken        1130 non-null   int64  
 14  Stumpings            1130 non-null   int64  
 15  Matches_Bowled       1130 non-null   int64  
 16  Balls_Bowled         1130 non-null   int64  
 17  Runs_Conceded        1130 non-null   int64  
 18  Wickets_Taken        1130 non-null   int64  
 19  Best_Bowling_Match   419 non-null    float64
 20  Bowling_Average      1130 non-null   float64
 21  Economy_Rate         1130 non-null   float64
 22  Bowling_Strike_Rate  1130 non-null   float64
 23  Four_Wicket_Hauls    1130 non-null   int64  
 24  Five_Wicket_Hauls    1130 non-null   int64  
dtypes: float64(7), int64(17), object(1)
memory usage: 220.8+ KB

In [4]:

#df.describe()

Warm Up Activities¶

In [6]:

# Run the cell below and find out which players we'll be expecting to see 🙌🏻

In [7]:

len(df['Player_Name'].unique())

Out[7]:

1. Know the players!¶

In [ ]:

# Write your code here

2. Find the Maximum Runs Scored¶

In [8]:

df['Runs_Scored'].max()

Out[8]:

3. How old is the dataset?¶

In [11]:

len(df['Year'].unique())

Out[11]:

4. What is the maximum Batting Strike Rate value?¶

In [12]:

df['Batting_Strike_Rate'].max()

Out[12]:

400.0

Basic Filtering with `loc[]` and Sorting with `sort_values()`¶

5. Let's filter out players who played in the 2023!¶

In [13]:

players_2023 = df.loc[df['Year']==2023]

6. Top 5 Players with Max Not Outs¶

In [17]:

top_not_outs = df.sort_values('Not_Outs',ascending=False).head(5)

Just for exploration!¶

Analyzing High Scorers of the Season¶

Before we dive into the details of players scoring the most runs in 2023, let’s take a broader look at the run distribution over the years.
A histogram will help us understand in which years players scored more freely, potentially reflecting changes in playing conditions or rules.

In [ ]:

# Run the below cell to see the high run scorers of the season!

In [19]:

# Histogram of runs scored by year

plt.figure(figsize=(10, 6))
df.groupby('Year')['Runs_Scored'].sum().plot(kind='bar', color='lightcoral')
plt.title('Total Runs Scored Per Year')
plt.xlabel('Year')
plt.ylabel('Total Runs')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()

No description has been provided for this image

7. Top 5 Run Scorers of 2023¶

In [26]:

top_scorers_2023 = players_2023.sort_values('Runs_Scored',ascending=False).head(5)

8. Selecting Players with High Batting Average¶

In [28]:

high_average_players = df.loc[df['Batting_Average'] > 50]

Just for Exploration¶

The year 2023 was a real treat for cricket fans – it saw the highest number of centuries! Was it the pitches, the bats, or just plain skill?
Let’s check out a histogram of centuries made over the years and see how 2023 stood out

In [ ]:

# Run the below cell to find out the maximum number of centuries hit each season!

In [30]:

# Histogram of centuries by year
import matplotlib.pyplot as plt

centuries_per_year = df.groupby('Year')['Centuries'].sum()

plt.figure(figsize=(10, 6))
centuries_per_year.plot(kind='bar', color='orange')
plt.title('Centuries Recorded Per Year')
plt.xlabel('Year')
plt.ylabel('Total Centuries')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Multi Condition Filtering¶

9. Compound Filtering for Centuries and Wickets¶

In [32]:

century_and_wickets = df.loc[(df['Centuries'] >= 1) & (df['Wickets_Taken'] >= 3)]

10. Recent High Scores¶

In [34]:

high_scores_2021 = df.loc[(df['Year'] == 2021) & (df['Runs_Scored'] > 50)]

11. Filter out the Efficient Strikers from the dataset¶

The distribution of batting strike rates is heavily concentrated around the median of approximately 125. However, there are exceptional instances where the batting strike rates far exceed the average, reaching as high as 350 and beyond. These outliers represent the top performers, indicating a highly aggressive batting style.

Here is the distribution of batting strike rates, with a reference line at the value 250, highlighting these extraordinary performances

In [ ]:

# Run the below cell for distribution

In [36]:

# Boxplot of batting strike rates
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['Batting_Strike_Rate'])
plt.title('Boxplot of Batting Strike Rates')
plt.axvline(x=250, color='red', linestyle='--', label='Reference Line')
plt.xlabel('Batting Strike Rate')
plt.show()

In [37]:

efficient_strikers = df.loc[(df['Batting_Strike_Rate'] > 150) & (df['Runs_Scored'] > 200)]

12. Filtering with Multiple Conditions¶

In [39]:

specific_players = df.loc[(df['Runs_Scored'] > 300) & (df['Wickets_Taken'] == 0)]

13. Who was the First Player to Achieve a Five Wicket Haul?!¶

In [48]:

first_five_wicket_haul = df.loc[df['Five_Wicket_Hauls'] == 1].nsmallest(1,'Year')

Query Activities¶

14. Query for Five Wicket Hauls¶

In [49]:

five_wicket_hauls = df.query('Five_Wicket_Hauls == 1')

By the way , comment down below which is your fav IPL team this season!¶

In [ ]:

15. Filtering with Query¶

In [51]:

active_high_scorers = df.query('Matches_Batted > 15 & Runs_Scored > 500')

Multi Condition Sorting¶

16. Who is the Titan ?¶

In [55]:

sorted_by_runs_and_score = df.sort_values(['Runs_Scored','Highest_Score'],ascending=False)

17. How well do you understand the impact of economy rate changes over the years?¶

In [ ]:

Understanding Economy Rate¶

The economy rate is a key bowling metric that calculates the average runs conceded per over bowled—the lower the economy rate, the better the bowling!

From the below boxplot, it's clear that most bowlers consistently maintained a moderate economy rate over the years, with recent seasons like 2023 and 2024 showing a few instances of unusually high rates, hinting at some challenging games for the bowlers.

In [ ]:

# Run the below cell for understanding the 'Economy rate' over the seasons

In [57]:

df_clean = df.dropna(subset=['Economy_Rate'])

# Create a boxplot of economy rates across years
plt.figure(figsize=(14, 8))
sns.boxplot(x='Year', y='Economy_Rate', data=df_clean, palette='coolwarm')
plt.title('Boxplot of Economy Rates by Year')
plt.xlabel('Year')
plt.ylabel('Economy Rate')
plt.xticks(rotation=45)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

/tmp/ipykernel_18/1520780003.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='Year', y='Economy_Rate', data=df_clean, palette='coolwarm')

18. Sorting by Economy Rate for 2022¶

In [61]:

sorted_economy_2022 = df.loc[df['Year']==2022].sort_values('Economy_Rate')

Filtering using : `isin()`, `nsmallest()`, and `nlargest()`¶

19. Filtering Players by Specific Match Conditions¶

In [63]:

years = [2021, 2022]
players_recent_years = df.loc[df['Year'].isin(years)]

Just for exploration¶

Let's plot the averages before we focus on players with extraordinary averages.
This line graph will help us see trends and anomalies in batting performances.

In [ ]:

# Run the below cell for understanding the  'Batting Averages' over the years

In [65]:

# Line plot of batting averages over the years
average_by_year = df.groupby('Year')['Batting_Average'].mean()

plt.figure(figsize=(10, 6))
average_by_year.plot(kind='line', marker='o', linestyle='-', color='navy')
plt.title('Average Batting Averages Over the Years')
plt.xlabel('Year')
plt.ylabel('Batting Average')
plt.grid(True)
plt.show()

20. Analyze Rohit Sharma's Matches Across Different Years¶

In [ ]:

In [81]:

rohit_matches = df.loc[df['Player_Name']=='Rohit Sharma',('Year','Matches_Batted')].sort_values('Year')
rohit_matches

Out[81]:

	Year	Matches_Batted
577	2008	13
576	2009	16
575	2010	16
574	2011	16
573	2012	17
572	2013	19
571	2014	15
570	2015	16
569	2016	14
568	2017	17
567	2018	14
566	2019	15
565	2020	12
564	2021	13
563	2022	14
562	2023	16
561	2024	1

Just for Exploration¶

Let’s use a violin plot to visualize the distribution of wickets taken across all matches, emphasizing performances in games where bowlers took more than five wickets.
The widest part of the plot indicates that most wickets taken fall around the 5 to 10 range, suggesting this is the most common performance for bowlers in significant matches.
The plot's long, thin tails extending towards the higher wicket counts, such as 20 and above, indicate these are rarer, exceptional performances.

In [ ]:

# Run the below cell

In [83]:

# Filter for high-performance games
high_wicket_games = df[df['Wickets_Taken'] >= 5]

# Violin plot for distribution of wickets in high-performance games
plt.figure(figsize=(12, 6))
sns.violinplot(x='Wickets_Taken', data=high_wicket_games)
plt.title('Distribution of Wickets Taken in High-Performance Games')
plt.xlabel('Wickets Taken')
plt.grid(True)
plt.show()

21. Who is the Purple Cap Winner of 2022?!¶

In [88]:

max_wickets_2022 = df.loc[df['Year'] == 2022].sort_values('Wickets_Taken',ascending=False).head(1)

22. Identify the bowler with the lowest economy rate¶

In [92]:

lowest_economy_rate = df.loc[(df['Wickets_Taken'] > 0) & (df['Economy_Rate'] > 0)].nsmallest(1,'Economy_Rate')

23. Best Bowling Performance of that Player¶

In [94]:

best_bowling_performance = df.loc[df['Player_Name']=='Ravichandran Ashwin'].nlargest(1,'Best_Bowling_Match')

Just for Exploration !¶

Every year we see a great number of new talents!
Let's plot a pie chart to understand the percentage of number of new debutants every year!

In [ ]:

# Run the below cell for understanding the % of debutants every season!

In [ ]:

df['Debut_Year'] = df.sort_values('Year').drop_duplicates('Player_Name')['Year']
debutants_count = df[df['Debut_Year'].notna()]['Debut_Year'].value_counts().sort_index()

plt.figure(figsize=(10, 8))
plt.pie(debutants_count, labels=debutants_count.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired(range(len(debutants_count))))
plt.title('Percentage of Debutants by Year')
plt.axis('equal')  
plt.show()

Statement of Completion#7fde81d0

Intro to Pandas for Data Analysis

Explore, Filter, Sort and Analyse Cricket Sports Data Using Pandas

Pitch Perfect Analysis: Cricket Stats with Pandas¶

Setting the Stage¶

Warm Up Activities¶

1. Know the players!¶

2. Find the Maximum Runs Scored¶

3. How old is the dataset?¶

4. What is the maximum Batting Strike Rate value?¶

Basic Filtering with loc[] and Sorting with sort_values()¶

5. Let's filter out players who played in the 2023!¶

6. Top 5 Players with Max Not Outs¶

Just for exploration!¶

Analyzing High Scorers of the Season¶

7. Top 5 Run Scorers of 2023¶

8. Selecting Players with High Batting Average¶

Just for Exploration¶

Multi Condition Filtering¶

9. Compound Filtering for Centuries and Wickets¶

10. Recent High Scores¶

11. Filter out the Efficient Strikers from the dataset¶

12. Filtering with Multiple Conditions¶

13. Who was the First Player to Achieve a Five Wicket Haul?!¶

Query Activities¶

14. Query for Five Wicket Hauls¶

By the way , comment down below which is your fav IPL team this season!¶

15. Filtering with Query¶

Multi Condition Sorting¶

16. Who is the Titan ?¶

17. How well do you understand the impact of economy rate changes over the years?¶

Understanding Economy Rate¶

18. Sorting by Economy Rate for 2022¶

Filtering using : isin(), nsmallest(), and nlargest()¶

19. Filtering Players by Specific Match Conditions¶

Just for exploration¶

20. Analyze Rohit Sharma's Matches Across Different Years¶

Just for Exploration¶

21. Who is the Purple Cap Winner of 2022?!¶

22. Identify the bowler with the lowest economy rate¶

23. Best Bowling Performance of that Player¶

Just for Exploration !¶

GAME-OVER 😔¶

Basic Filtering with `loc[]` and Sorting with `sort_values()`¶

Filtering using : `isin()`, `nsmallest()`, and `nlargest()`¶