Mars :D has successfully completed this project.

Intro to Pandas for Data Analysis

medium

4.61

Explore, Filter, Sort and Analyse Cricket Sports Data Using Pandas

Finished

February 3, 2025 5:35 PM

Elapsed time (min)

Completed activities

Resolution

Activities

Pitch Perfect Analysis: Cricket Stats with Pandas¶

Setting the Stage¶

In [1]:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv("preprocessed_data.csv")

In [2]:

# Take a peek at the data
df.head()

Out[2]:

	Year	Player_Name	Matches_Batted	Not_Outs	Runs_Scored	Highest_Score	Batting_Average	Balls_Faced	Batting_Strike_Rate	Centuries	...	Matches_Bowled
0	2024	Ruturaj Gaikwad	2	0	61	46.0	30.50	51	119.61	0	...	2
1	2023	Ruturaj Gaikwad	16	1	590	92.0	42.14	400	147.50	0	...	16
2	2022	Ruturaj Gaikwad	14	0	368	99.0	26.29	291	126.46	0	...	14
3	2021	Ruturaj Gaikwad	16	2	635	NaN	45.35	466	136.26	1	...	16
4	2020	Ruturaj Gaikwad	6	2	204	72.0	51.00	169	120.71	0	...	6

5 rows × 25 columns

In [3]:

#df.info()

In [4]:

#df.describe()

Warm Up Activities¶

In [5]:

# Run the cell below and find out which players we'll be expecting to see 🙌🏻

In [6]:

df['Player_Name'].unique()

Out[6]:

array(['Ruturaj Gaikwad', 'Devon Conway', 'MS Dhoni', 'Ajinkya Rahane',
       'Sameer Rizvi', 'Ravindra Jadeja', 'Mitchell Santner', 'Moeen Ali',
       'Shivam Dube', 'Rachin Ravindra', 'Shardul Thakur',
       'Daryl Mitchell', 'Rajvardhan Hangargekar', 'Deepak Chahar',
       'Maheesh Theekshana', 'Mukesh Choudhary', 'Mustafizur Rahman',
       'Prashant Solanki', 'Simarjeet Singh', 'Tushar Deshpande',
       'Matheesha Pathirana', 'Rishabh Pant', 'David Warner',
       'Prithvi Shaw', 'Yash Dhull', 'Shai Hope', 'Abishek Porel',
       'Ricky Bhui', 'Tristan Stubbs', 'Axar Patel', 'Lalit Yadav',
       'Mitchell Marsh', 'Sumit Kumar', 'Pravin Dubey', 'Anrich Nortje',
       'Kuldeep Yadav', 'Khaleel Ahmed', 'Ishant Sharma',
       'Jhye Richardson', 'Mukesh Kumar', 'Rasikh Dar', 'Shubman Gill',
       'David Miller', 'Matthew Wade', 'Wriddhiman Saha',
       'Kane Williamson', 'Abhinav Manohar', 'B. Sai Sudharsan',
       'Darshan Nalkande', 'Vijay Shankar', 'Azmatullah Omarzai',
       'Shahrukh Khan', 'Jayant Yadav', 'Rahul Tewatia', 'Mohammad Shami',
       'Kartik Tyagi', 'Spencer Johnson', 'Noor Ahmad', 'Sai Kishore',
       'Umesh Yadav', 'Rashid Khan', 'Joshua Little', 'Mohit Sharma',
       'Shreyas Iyer', 'Rinku Singh', 'Rahmanullah Gurbaz', 'Nitish Rana',
       'Sherfane Rutherford', 'Phil Salt', 'K.S Bharat', 'Manish Pandey',
       'Anukul Roy', 'Ramandeep Singh', 'Andre Russell', 'Venkatesh Iyer',
       'Suyash Sharma', 'Mujeeb Ur Rahman', 'Dushmantha Chameera',
       'Harshit Rana', 'Sunil Narine', 'Vaibhav Arora',
       'Varun Chakaravarthy', 'Mitchell Starc', 'Chetan Sakariya',
       'KL Rahul', 'Devdutt Padikkal', 'Quinton de Kock',
       'Nicholas Pooran', 'Ashton Turner', 'Ayush Badoni', 'Deepak Hooda',
       'Krishnappa Gowtham', 'Krunal Pandya', 'Kyle Mayers',
       'Marcus Stoinis', 'Mohd. Arshad Khan', 'Prerak Mankad',
       'Yudhvir Singh', 'David Willey', 'Shivam Mavi', 'Mohsin Khan',
       'Ravi Bishnoi', 'Yash Thakur', 'Amit Mishra', 'Naveen Ul Haq',
       'Rohit Sharma', 'Dewald Brevis', 'Suryakumar Yadav',
       'Ishan Kishan', 'N. Tilak Varma', 'Tim David', 'Vishnu Vinod',
       'Hardik Pandya', 'Arjun Tendulkar', 'Mohammad Nabi',
       'Romario Shepherd', 'Shams Mulani', 'Nehal Wadhera',
       'Gerald Coetzee', 'Naman Dhir', 'Jasprit Bumrah',
       'Kumar Kartikeya Singh', 'Piyush Chawla', 'Akash Madhwal',
       'Luke Wood', 'Shreyas Gopal', 'Shikhar Dhawan', 'Jitesh Sharma',
       'Jonny Bairstow', 'Prabhsimran Singh', 'Liam Livingstone',
       'Harpreet Bhatia', 'Rilee Rossouw', 'Shashank Singh',
       'Chris Woakes', 'Atharva Taide', 'Rishi Dhawan', 'Sam Curran',
       'Sikandar Raza', 'Harpreet Brar', 'Arshdeep Singh',
       'Kagiso Rabada', 'Nathan Ellis', 'Rahul Chahar', 'Harshal Patel',
       'Sanju Samson', 'Jos Buttler', 'Shimron Hetmyer',
       'Yashasvi Jaiswal', 'Dhruv Jurel', 'Riyan Parag', 'Rovman Powell',
       'Ravichandran Ashwin', 'Avesh Khan', 'Kuldeep Sen',
       'Navdeep Saini', 'Prasidh Krishna', 'Sandeep Sharma',
       'Trent Boult', 'Yuzvendra Chahal', 'Adam Zampa', 'Nandre Burger',
       'Faf du Plessis', 'Rajat Patidar', 'Virat Kohli', 'Anuj Rawat',
       'Dinesh Karthik', 'Suyash S Prabhudessai', 'Glenn Maxwell',
       'Mahipal Lomror', 'Karn Sharma', 'Cameron Green', 'Swapnil Singh',
       'Mayank Dagar', 'Akash Deep', 'Alzarri Joseph', 'Lockie Ferguson',
       'Mohammed Siraj', 'Yash Dayal', 'Tom Curran', 'Reece Topley',
       'Himanshu Sharma', 'Vyshak Vijay Kumar', 'Abdul Samad',
       'Aiden Markram', 'Rahul Tripathi', 'Glenn Phillips',
       'Mayank Agarwal', 'Anmolpreet Singh', 'Heinrich Klaasen',
       'Nitish Kumar Reddy', 'Travis Head', 'Abhishek Sharma',
       'Marco Jansen', 'Washington Sundar', 'Sanvir Singh',
       'Wanindu Hasaranga', 'Pat Cummins', 'Akash Singh',
       'Shahbaz Ahamad', 'Bhuvneshwar Kumar', 'Fazalhaq Farooqi',
       'Jaydev Unadkat', 'T Natarajan', 'Umran Malik', 'Mayank Markande'],
      dtype=object)

1. Know the players!¶

In [7]:

# Write your code here
len(df['Player_Name'].unique())

Out[7]:

2. Find the Maximum Runs Scored¶

In [8]:

# Write your code here
df['Runs_Scored'].max()

Out[8]:

3. How old is the dataset?¶

In [9]:

# Write your code here
len(df['Year'].unique())

Out[9]:

4. What is the maximum Batting Strike Rate value?¶

In [10]:

# Write your code here
df['Batting_Strike_Rate'].max()

Out[10]:

400.0

Basic Filtering with `loc[]` and Sorting with `sort_values()`¶

5. Let's filter out players who played in the 2023!¶

In [11]:

players_2023 = df.loc[df['Year'] == 2023]

6. Top 5 Players with Max Not Outs¶

In [15]:

top_not_outs = df.sort_values(by='Not_Outs', ascending=False).head(5)

Just for exploration!¶

Analyzing High Scorers of the Season¶

Before we dive into the details of players scoring the most runs in 2023, let’s take a broader look at the run distribution over the years.
A histogram will help us understand in which years players scored more freely, potentially reflecting changes in playing conditions or rules.

In [16]:

# Run the below cell to see the high run scorers of the season!

In [17]:

# Histogram of runs scored by year

plt.figure(figsize=(10, 6))
df.groupby('Year')['Runs_Scored'].sum().plot(kind='bar', color='lightcoral')
plt.title('Total Runs Scored Per Year')
plt.xlabel('Year')
plt.ylabel('Total Runs')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()

No description has been provided for this image

7. Top 5 Run Scorers of 2023¶

In [18]:

top_scorers_2023 = df[df['Year'] == 2023].sort_values(by='Runs_Scored', ascending=False).head(5)

8. Selecting Players with High Batting Average¶

In [22]:

high_average_players = df.loc[df['Batting_Average'] > 50]

Just for Exploration¶

The year 2023 was a real treat for cricket fans – it saw the highest number of centuries! Was it the pitches, the bats, or just plain skill?
Let’s check out a histogram of centuries made over the years and see how 2023 stood out

In [23]:

# Run the below cell to find out the maximum number of centuries hit each season!

In [24]:

# Histogram of centuries by year
import matplotlib.pyplot as plt

centuries_per_year = df.groupby('Year')['Centuries'].sum()

plt.figure(figsize=(10, 6))
centuries_per_year.plot(kind='bar', color='orange')
plt.title('Centuries Recorded Per Year')
plt.xlabel('Year')
plt.ylabel('Total Centuries')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Multi Condition Filtering¶

9. Compound Filtering for Centuries and Wickets¶

In [25]:

century_and_wickets = df.loc[(df['Centuries'] > 0) & (df['Wickets_Taken'] >= 3)]

10. Recent High Scores¶

In [29]:

high_scores_2021 = df[(df['Year'] == 2021) & (df['Highest_Score'] > 50)]['Player_Name']
high_scores_2021

Out[29]:

63            Moeen Ali
123        David Warner
137        Prithvi Shaw
216        Shubman Gill
223        David Miller
366         Nitish Rana
402       Andre Russell
414      Venkatesh Iyer
492        Deepak Hooda
564        Rohit Sharma
582    Suryakumar Yadav
594        Ishan Kishan
673      Shikhar Dhawan
775        Sanju Samson
787         Jos Buttler
945       Glenn Maxwell
Name: Player_Name, dtype: object

11. Filter out the Efficient Strikers from the dataset¶

The distribution of batting strike rates is heavily concentrated around the median of approximately 125. However, there are exceptional instances where the batting strike rates far exceed the average, reaching as high as 350 and beyond. These outliers represent the top performers, indicating a highly aggressive batting style.

Here is the distribution of batting strike rates, with a reference line at the value 250, highlighting these extraordinary performances

In [30]:

# Run the below cell for distribution

In [31]:

# Boxplot of batting strike rates
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['Batting_Strike_Rate'])
plt.title('Boxplot of Batting Strike Rates')
plt.axvline(x=250, color='red', linestyle='--', label='Reference Line')
plt.xlabel('Batting Strike Rate')
plt.show()

In [33]:

efficient_strikers = df[(df['Batting_Strike_Rate'] > 150) & (df['Runs_Scored'] > 200)]

12. Filtering with Multiple Conditions¶

In [35]:

specific_players = df.loc[(df['Runs_Scored'] > 300) & (df['Wickets_Taken'] == 0)]

13. Who was the First Player to Achieve a Five Wicket Haul?!¶

In [38]:

first_five_wicket_haul = df[df['Five_Wicket_Hauls'] > 0].sort_values(by='Year').nsmallest(1, 'Five_Wicket_Hauls')
first_five_wicket_haul

Out[38]:

	Year	Player_Name	Matches_Batted	Not_Outs	Runs_Scored	Highest_Score	Batting_Average	Balls_Faced	Batting_Strike_Rate	Centuries	...	Matches_Bowled	Balls_Bowled	Runs_Conceded	Wickets_Taken	Best_Bowling_Match	Bowling_Average	Economy_Rate	Bowling_Strike_Rate	Four_Wicket_Hauls	Five_Wicket_Hauls
556	2008	Amit Mishra	6	0	37	31.0	12.33	42	88.09	0	...	6	119	142	11	NaN	12.9	7.15	10.81	0	1

1 rows × 25 columns

Query Activities¶

14. Query for Five Wicket Hauls¶

In [40]:

five_wicket_hauls = df.query('Five_Wicket_Hauls >= 1')

By the way , comment down below which is your fav IPL team this season!¶

In [ ]:

15. Filtering with Query¶

In [42]:

active_high_scorers = df.query('Matches_Batted > 15 & Runs_Scored > 500')

Multi Condition Sorting¶

16. Who is the Titan ?¶

In [44]:

sorted_by_runs_and_score = df.sort_values(by=['Runs_Scored', 'Highest_Score'], ascending=[False, False])

17. How well do you understand the impact of economy rate changes over the years?¶

In [ ]:

Understanding Economy Rate¶

The economy rate is a key bowling metric that calculates the average runs conceded per over bowled—the lower the economy rate, the better the bowling!

From the below boxplot, it's clear that most bowlers consistently maintained a moderate economy rate over the years, with recent seasons like 2023 and 2024 showing a few instances of unusually high rates, hinting at some challenging games for the bowlers.

In [46]:

# Run the below cell for understanding the 'Economy rate' over the seasons

In [48]:

df_clean = df.dropna(subset=['Economy_Rate'])

# Create a boxplot of economy rates across years
plt.figure(figsize=(14, 8))
sns.boxplot(x='Year', y='Economy_Rate', data=df_clean, palette='coolwarm')
plt.title('Boxplot of Economy Rates by Year')
plt.xlabel('Year')
plt.ylabel('Economy Rate')
plt.xticks(rotation=45)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

/tmp/ipykernel_16/1520780003.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='Year', y='Economy_Rate', data=df_clean, palette='coolwarm')

In [ ]:

18. Sorting by Economy Rate for 2022¶

In [49]:

sorted_economy_2022 = df.loc[df['Year'] == 2022].sort_values(by='Economy_Rate', ascending=True)

Filtering using : `isin()`, `nsmallest()`, and `nlargest()`¶

19. Filtering Players by Specific Match Conditions¶

In [51]:

years = [2021, 2022]
players_recent_years = df[df['Year'].isin(years)]

In [ ]:

Just for exploration¶

Let's plot the averages before we focus on players with extraordinary averages.
This line graph will help us see trends and anomalies in batting performances.

In [53]:

# Run the below cell for understanding the  'Batting Averages' over the years

In [54]:

# Line plot of batting averages over the years
average_by_year = df.groupby('Year')['Batting_Average'].mean()

plt.figure(figsize=(10, 6))
average_by_year.plot(kind='line', marker='o', linestyle='-', color='navy')
plt.title('Average Batting Averages Over the Years')
plt.xlabel('Year')
plt.ylabel('Batting Average')
plt.grid(True)
plt.show()

20. Analyze Rohit Sharma's Matches Across Different Years¶

In [57]:

rohit_matches = df.loc[df['Player_Name'] == 'Rohit Sharma', ['Year', 'Matches_Batted']].sort_values(by='Year')
rohit_matches

Out[57]:

	Year	Matches_Batted
577	2008	13
576	2009	16
575	2010	16
574	2011	16
573	2012	17
572	2013	19
571	2014	15
570	2015	16
569	2016	14
568	2017	17
567	2018	14
566	2019	15
565	2020	12
564	2021	13
563	2022	14
562	2023	16
561	2024	1

Just for Exploration¶

Let’s use a violin plot to visualize the distribution of wickets taken across all matches, emphasizing performances in games where bowlers took more than five wickets.
The widest part of the plot indicates that most wickets taken fall around the 5 to 10 range, suggesting this is the most common performance for bowlers in significant matches.
The plot's long, thin tails extending towards the higher wicket counts, such as 20 and above, indicate these are rarer, exceptional performances.

In [58]:

# Run the below cell

In [59]:

# Filter for high-performance games
high_wicket_games = df[df['Wickets_Taken'] >= 5]

# Violin plot for distribution of wickets in high-performance games
plt.figure(figsize=(12, 6))
sns.violinplot(x='Wickets_Taken', data=high_wicket_games)
plt.title('Distribution of Wickets Taken in High-Performance Games')
plt.xlabel('Wickets Taken')
plt.grid(True)
plt.show()

21. Who is the Purple Cap Winner of 2022?!¶

In [61]:

max_wickets_2022 = df[df['Year'] == 2022].nlargest(1, 'Wickets_Taken')

22. Identify the bowler with the lowest economy rate¶

In [63]:

lowest_economy_rate = df[(df['Economy_Rate'] > 0) & (df['Wickets_Taken'] > 0)].nsmallest(1, 'Economy_Rate')

23. Best Bowling Performance of that Player¶

In [66]:

best_bowling_performance = df[df['Player_Name'] == lowest_economy_rate.iloc[0]['Player_Name']].nlargest(1, 'Best_Bowling_Match')

Just for Exploration !¶

Every year we see a great number of new talents!
Let's plot a pie chart to understand the percentage of number of new debutants every year!

In [67]:

# Run the below cell for understanding the % of debutants every season!

In [68]:

df['Debut_Year'] = df.sort_values('Year').drop_duplicates('Player_Name')['Year']
debutants_count = df[df['Debut_Year'].notna()]['Debut_Year'].value_counts().sort_index()

plt.figure(figsize=(10, 8))
plt.pie(debutants_count, labels=debutants_count.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired(range(len(debutants_count))))
plt.title('Percentage of Debutants by Year')
plt.axis('equal')  
plt.show()

Statement of Completion#5b6a009d

Intro to Pandas for Data Analysis

Explore, Filter, Sort and Analyse Cricket Sports Data Using Pandas

Pitch Perfect Analysis: Cricket Stats with Pandas¶

Setting the Stage¶

Warm Up Activities¶

1. Know the players!¶

2. Find the Maximum Runs Scored¶

3. How old is the dataset?¶

4. What is the maximum Batting Strike Rate value?¶

Basic Filtering with loc[] and Sorting with sort_values()¶

5. Let's filter out players who played in the 2023!¶

6. Top 5 Players with Max Not Outs¶

Just for exploration!¶

Analyzing High Scorers of the Season¶

7. Top 5 Run Scorers of 2023¶

8. Selecting Players with High Batting Average¶

Just for Exploration¶

Multi Condition Filtering¶

9. Compound Filtering for Centuries and Wickets¶

10. Recent High Scores¶

11. Filter out the Efficient Strikers from the dataset¶

12. Filtering with Multiple Conditions¶

13. Who was the First Player to Achieve a Five Wicket Haul?!¶

Query Activities¶

14. Query for Five Wicket Hauls¶

By the way , comment down below which is your fav IPL team this season!¶

15. Filtering with Query¶

Multi Condition Sorting¶

16. Who is the Titan ?¶

17. How well do you understand the impact of economy rate changes over the years?¶

Understanding Economy Rate¶

18. Sorting by Economy Rate for 2022¶

Filtering using : isin(), nsmallest(), and nlargest()¶

19. Filtering Players by Specific Match Conditions¶

Just for exploration¶

20. Analyze Rohit Sharma's Matches Across Different Years¶

Just for Exploration¶

21. Who is the Purple Cap Winner of 2022?!¶

22. Identify the bowler with the lowest economy rate¶

23. Best Bowling Performance of that Player¶

Just for Exploration !¶

GAME-OVER 😔¶

Basic Filtering with `loc[]` and Sorting with `sort_values()`¶

Filtering using : `isin()`, `nsmallest()`, and `nlargest()`¶