Statement of Completion#5b6a009d
Intro to Pandas for Data Analysis
medium
Explore, Filter, Sort and Analyse Cricket Sports Data Using Pandas
Resolution
Activities
Pitch Perfect Analysis: Cricket Stats with Pandas¶
Setting the Stage¶
In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv("preprocessed_data.csv")
In [2]:
# Take a peek at the data
df.head()
Out[2]:
Year | Player_Name | Matches_Batted | Not_Outs | Runs_Scored | Highest_Score | Batting_Average | Balls_Faced | Batting_Strike_Rate | Centuries | ... | Matches_Bowled | Balls_Bowled | Runs_Conceded | Wickets_Taken | Best_Bowling_Match | Bowling_Average | Economy_Rate | Bowling_Strike_Rate | Four_Wicket_Hauls | Five_Wicket_Hauls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2024 | Ruturaj Gaikwad | 2 | 0 | 61 | 46.0 | 30.50 | 51 | 119.61 | 0 | ... | 2 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
1 | 2023 | Ruturaj Gaikwad | 16 | 1 | 590 | 92.0 | 42.14 | 400 | 147.50 | 0 | ... | 16 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
2 | 2022 | Ruturaj Gaikwad | 14 | 0 | 368 | 99.0 | 26.29 | 291 | 126.46 | 0 | ... | 14 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
3 | 2021 | Ruturaj Gaikwad | 16 | 2 | 635 | NaN | 45.35 | 466 | 136.26 | 1 | ... | 16 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
4 | 2020 | Ruturaj Gaikwad | 6 | 2 | 204 | 72.0 | 51.00 | 169 | 120.71 | 0 | ... | 6 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
5 rows × 25 columns
In [3]:
#df.info()
In [4]:
#df.describe()
Warm Up Activities¶
In [5]:
# Run the cell below and find out which players we'll be expecting to see 🙌🏻
In [6]:
df['Player_Name'].unique()
Out[6]:
array(['Ruturaj Gaikwad', 'Devon Conway', 'MS Dhoni', 'Ajinkya Rahane', 'Sameer Rizvi', 'Ravindra Jadeja', 'Mitchell Santner', 'Moeen Ali', 'Shivam Dube', 'Rachin Ravindra', 'Shardul Thakur', 'Daryl Mitchell', 'Rajvardhan Hangargekar', 'Deepak Chahar', 'Maheesh Theekshana', 'Mukesh Choudhary', 'Mustafizur Rahman', 'Prashant Solanki', 'Simarjeet Singh', 'Tushar Deshpande', 'Matheesha Pathirana', 'Rishabh Pant', 'David Warner', 'Prithvi Shaw', 'Yash Dhull', 'Shai Hope', 'Abishek Porel', 'Ricky Bhui', 'Tristan Stubbs', 'Axar Patel', 'Lalit Yadav', 'Mitchell Marsh', 'Sumit Kumar', 'Pravin Dubey', 'Anrich Nortje', 'Kuldeep Yadav', 'Khaleel Ahmed', 'Ishant Sharma', 'Jhye Richardson', 'Mukesh Kumar', 'Rasikh Dar', 'Shubman Gill', 'David Miller', 'Matthew Wade', 'Wriddhiman Saha', 'Kane Williamson', 'Abhinav Manohar', 'B. Sai Sudharsan', 'Darshan Nalkande', 'Vijay Shankar', 'Azmatullah Omarzai', 'Shahrukh Khan', 'Jayant Yadav', 'Rahul Tewatia', 'Mohammad Shami', 'Kartik Tyagi', 'Spencer Johnson', 'Noor Ahmad', 'Sai Kishore', 'Umesh Yadav', 'Rashid Khan', 'Joshua Little', 'Mohit Sharma', 'Shreyas Iyer', 'Rinku Singh', 'Rahmanullah Gurbaz', 'Nitish Rana', 'Sherfane Rutherford', 'Phil Salt', 'K.S Bharat', 'Manish Pandey', 'Anukul Roy', 'Ramandeep Singh', 'Andre Russell', 'Venkatesh Iyer', 'Suyash Sharma', 'Mujeeb Ur Rahman', 'Dushmantha Chameera', 'Harshit Rana', 'Sunil Narine', 'Vaibhav Arora', 'Varun Chakaravarthy', 'Mitchell Starc', 'Chetan Sakariya', 'KL Rahul', 'Devdutt Padikkal', 'Quinton de Kock', 'Nicholas Pooran', 'Ashton Turner', 'Ayush Badoni', 'Deepak Hooda', 'Krishnappa Gowtham', 'Krunal Pandya', 'Kyle Mayers', 'Marcus Stoinis', 'Mohd. Arshad Khan', 'Prerak Mankad', 'Yudhvir Singh', 'David Willey', 'Shivam Mavi', 'Mohsin Khan', 'Ravi Bishnoi', 'Yash Thakur', 'Amit Mishra', 'Naveen Ul Haq', 'Rohit Sharma', 'Dewald Brevis', 'Suryakumar Yadav', 'Ishan Kishan', 'N. Tilak Varma', 'Tim David', 'Vishnu Vinod', 'Hardik Pandya', 'Arjun Tendulkar', 'Mohammad Nabi', 'Romario Shepherd', 'Shams Mulani', 'Nehal Wadhera', 'Gerald Coetzee', 'Naman Dhir', 'Jasprit Bumrah', 'Kumar Kartikeya Singh', 'Piyush Chawla', 'Akash Madhwal', 'Luke Wood', 'Shreyas Gopal', 'Shikhar Dhawan', 'Jitesh Sharma', 'Jonny Bairstow', 'Prabhsimran Singh', 'Liam Livingstone', 'Harpreet Bhatia', 'Rilee Rossouw', 'Shashank Singh', 'Chris Woakes', 'Atharva Taide', 'Rishi Dhawan', 'Sam Curran', 'Sikandar Raza', 'Harpreet Brar', 'Arshdeep Singh', 'Kagiso Rabada', 'Nathan Ellis', 'Rahul Chahar', 'Harshal Patel', 'Sanju Samson', 'Jos Buttler', 'Shimron Hetmyer', 'Yashasvi Jaiswal', 'Dhruv Jurel', 'Riyan Parag', 'Rovman Powell', 'Ravichandran Ashwin', 'Avesh Khan', 'Kuldeep Sen', 'Navdeep Saini', 'Prasidh Krishna', 'Sandeep Sharma', 'Trent Boult', 'Yuzvendra Chahal', 'Adam Zampa', 'Nandre Burger', 'Faf du Plessis', 'Rajat Patidar', 'Virat Kohli', 'Anuj Rawat', 'Dinesh Karthik', 'Suyash S Prabhudessai', 'Glenn Maxwell', 'Mahipal Lomror', 'Karn Sharma', 'Cameron Green', 'Swapnil Singh', 'Mayank Dagar', 'Akash Deep', 'Alzarri Joseph', 'Lockie Ferguson', 'Mohammed Siraj', 'Yash Dayal', 'Tom Curran', 'Reece Topley', 'Himanshu Sharma', 'Vyshak Vijay Kumar', 'Abdul Samad', 'Aiden Markram', 'Rahul Tripathi', 'Glenn Phillips', 'Mayank Agarwal', 'Anmolpreet Singh', 'Heinrich Klaasen', 'Nitish Kumar Reddy', 'Travis Head', 'Abhishek Sharma', 'Marco Jansen', 'Washington Sundar', 'Sanvir Singh', 'Wanindu Hasaranga', 'Pat Cummins', 'Akash Singh', 'Shahbaz Ahamad', 'Bhuvneshwar Kumar', 'Fazalhaq Farooqi', 'Jaydev Unadkat', 'T Natarajan', 'Umran Malik', 'Mayank Markande'], dtype=object)
1. Know the players!¶
In [7]:
# Write your code here
len(df['Player_Name'].unique())
Out[7]:
206
2. Find the Maximum Runs Scored¶
In [8]:
# Write your code here
df['Runs_Scored'].max()
Out[8]:
973
3. How old is the dataset?¶
In [9]:
# Write your code here
len(df['Year'].unique())
Out[9]:
17
4. What is the maximum Batting Strike Rate value?¶
In [10]:
# Write your code here
df['Batting_Strike_Rate'].max()
Out[10]:
400.0
Basic Filtering with loc[]
and Sorting with sort_values()
¶
5. Let's filter out players who played in the 2023!¶
In [11]:
players_2023 = df.loc[df['Year'] == 2023]
6. Top 5 Players with Max Not Outs¶
In [15]:
top_not_outs = df.sort_values(by='Not_Outs', ascending=False).head(5)
Just for exploration!¶
Analyzing High Scorers of the Season¶
Before we dive into the details of players scoring the most runs in 2023, let’s take a broader look at the run distribution over the years.
A histogram will help us understand in which years players scored more freely, potentially reflecting changes in playing conditions or rules.
In [16]:
# Run the below cell to see the high run scorers of the season!
In [17]:
# Histogram of runs scored by year
plt.figure(figsize=(10, 6))
df.groupby('Year')['Runs_Scored'].sum().plot(kind='bar', color='lightcoral')
plt.title('Total Runs Scored Per Year')
plt.xlabel('Year')
plt.ylabel('Total Runs')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()
7. Top 5 Run Scorers of 2023¶
In [18]:
top_scorers_2023 = df[df['Year'] == 2023].sort_values(by='Runs_Scored', ascending=False).head(5)
8. Selecting Players with High Batting Average¶
In [22]:
high_average_players = df.loc[df['Batting_Average'] > 50]
Just for Exploration¶
The year 2023 was a real treat for cricket fans – it saw the highest number of centuries! Was it the pitches, the bats, or just plain skill?
Let’s check out a histogram of centuries made over the years and see how 2023 stood out
In [23]:
# Run the below cell to find out the maximum number of centuries hit each season!
In [24]:
# Histogram of centuries by year
import matplotlib.pyplot as plt
centuries_per_year = df.groupby('Year')['Centuries'].sum()
plt.figure(figsize=(10, 6))
centuries_per_year.plot(kind='bar', color='orange')
plt.title('Centuries Recorded Per Year')
plt.xlabel('Year')
plt.ylabel('Total Centuries')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Multi Condition Filtering¶
9. Compound Filtering for Centuries and Wickets¶
In [25]:
century_and_wickets = df.loc[(df['Centuries'] > 0) & (df['Wickets_Taken'] >= 3)]
10. Recent High Scores¶
In [29]:
high_scores_2021 = df[(df['Year'] == 2021) & (df['Highest_Score'] > 50)]['Player_Name']
high_scores_2021
Out[29]:
63 Moeen Ali 123 David Warner 137 Prithvi Shaw 216 Shubman Gill 223 David Miller 366 Nitish Rana 402 Andre Russell 414 Venkatesh Iyer 492 Deepak Hooda 564 Rohit Sharma 582 Suryakumar Yadav 594 Ishan Kishan 673 Shikhar Dhawan 775 Sanju Samson 787 Jos Buttler 945 Glenn Maxwell Name: Player_Name, dtype: object
11. Filter out the Efficient Strikers from the dataset¶
The distribution of batting strike rates is heavily concentrated around the median of approximately 125. However, there are exceptional instances where the batting strike rates far exceed the average, reaching as high as 350 and beyond. These outliers represent the top performers, indicating a highly aggressive batting style.
Here is the distribution of batting strike rates, with a reference line at the value 250, highlighting these extraordinary performances
In [30]:
# Run the below cell for distribution
In [31]:
# Boxplot of batting strike rates
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['Batting_Strike_Rate'])
plt.title('Boxplot of Batting Strike Rates')
plt.axvline(x=250, color='red', linestyle='--', label='Reference Line')
plt.xlabel('Batting Strike Rate')
plt.show()
In [33]:
efficient_strikers = df[(df['Batting_Strike_Rate'] > 150) & (df['Runs_Scored'] > 200)]
12. Filtering with Multiple Conditions¶
In [35]:
specific_players = df.loc[(df['Runs_Scored'] > 300) & (df['Wickets_Taken'] == 0)]
13. Who was the First Player to Achieve a Five Wicket Haul?!¶
In [38]:
first_five_wicket_haul = df[df['Five_Wicket_Hauls'] > 0].sort_values(by='Year').nsmallest(1, 'Five_Wicket_Hauls')
first_five_wicket_haul
Out[38]:
Year | Player_Name | Matches_Batted | Not_Outs | Runs_Scored | Highest_Score | Batting_Average | Balls_Faced | Batting_Strike_Rate | Centuries | ... | Matches_Bowled | Balls_Bowled | Runs_Conceded | Wickets_Taken | Best_Bowling_Match | Bowling_Average | Economy_Rate | Bowling_Strike_Rate | Four_Wicket_Hauls | Five_Wicket_Hauls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
556 | 2008 | Amit Mishra | 6 | 0 | 37 | 31.0 | 12.33 | 42 | 88.09 | 0 | ... | 6 | 119 | 142 | 11 | NaN | 12.9 | 7.15 | 10.81 | 0 | 1 |
1 rows × 25 columns
Query Activities¶
14. Query for Five Wicket Hauls¶
In [40]:
five_wicket_hauls = df.query('Five_Wicket_Hauls >= 1')
By the way , comment down below which is your fav IPL team this season!¶
In [ ]:
In [ ]:
15. Filtering with Query¶
In [42]:
active_high_scorers = df.query('Matches_Batted > 15 & Runs_Scored > 500')
Multi Condition Sorting¶
16. Who is the Titan ?¶
In [44]:
sorted_by_runs_and_score = df.sort_values(by=['Runs_Scored', 'Highest_Score'], ascending=[False, False])
17. How well do you understand the impact of economy rate changes over the years?¶
In [ ]:
Understanding Economy Rate¶
The economy rate is a key bowling metric that calculates the average runs conceded per over bowled—the lower the economy rate, the better the bowling!
From the below boxplot, it's clear that most bowlers consistently maintained a moderate economy rate over the years, with recent seasons like 2023 and 2024 showing a few instances of unusually high rates, hinting at some challenging games for the bowlers.
In [46]:
# Run the below cell for understanding the 'Economy rate' over the seasons
In [48]:
df_clean = df.dropna(subset=['Economy_Rate'])
# Create a boxplot of economy rates across years
plt.figure(figsize=(14, 8))
sns.boxplot(x='Year', y='Economy_Rate', data=df_clean, palette='coolwarm')
plt.title('Boxplot of Economy Rates by Year')
plt.xlabel('Year')
plt.ylabel('Economy Rate')
plt.xticks(rotation=45)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
/tmp/ipykernel_16/1520780003.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x='Year', y='Economy_Rate', data=df_clean, palette='coolwarm')
In [ ]:
18. Sorting by Economy Rate for 2022¶
In [49]:
sorted_economy_2022 = df.loc[df['Year'] == 2022].sort_values(by='Economy_Rate', ascending=True)
Filtering using : isin()
, nsmallest()
, and nlargest()
¶
19. Filtering Players by Specific Match Conditions¶
In [51]:
years = [2021, 2022]
players_recent_years = df[df['Year'].isin(years)]
In [ ]:
Just for exploration¶
Let's plot the averages before we focus on players with extraordinary averages.
This line graph will help us see trends and anomalies in batting performances.
In [53]:
# Run the below cell for understanding the 'Batting Averages' over the years
In [54]:
# Line plot of batting averages over the years
average_by_year = df.groupby('Year')['Batting_Average'].mean()
plt.figure(figsize=(10, 6))
average_by_year.plot(kind='line', marker='o', linestyle='-', color='navy')
plt.title('Average Batting Averages Over the Years')
plt.xlabel('Year')
plt.ylabel('Batting Average')
plt.grid(True)
plt.show()
20. Analyze Rohit Sharma's Matches Across Different Years¶
In [57]:
rohit_matches = df.loc[df['Player_Name'] == 'Rohit Sharma', ['Year', 'Matches_Batted']].sort_values(by='Year')
rohit_matches
Out[57]:
Year | Matches_Batted | |
---|---|---|
577 | 2008 | 13 |
576 | 2009 | 16 |
575 | 2010 | 16 |
574 | 2011 | 16 |
573 | 2012 | 17 |
572 | 2013 | 19 |
571 | 2014 | 15 |
570 | 2015 | 16 |
569 | 2016 | 14 |
568 | 2017 | 17 |
567 | 2018 | 14 |
566 | 2019 | 15 |
565 | 2020 | 12 |
564 | 2021 | 13 |
563 | 2022 | 14 |
562 | 2023 | 16 |
561 | 2024 | 1 |
Just for Exploration¶
Let’s use a violin plot to visualize the distribution of wickets taken across all matches, emphasizing performances in games where bowlers took more than five wickets.
The widest part of the plot indicates that most wickets taken fall around the 5 to 10 range, suggesting this is the most common performance for bowlers in significant matches.
The plot's long, thin tails extending towards the higher wicket counts, such as 20 and above, indicate these are rarer, exceptional performances.
In [58]:
# Run the below cell
In [59]:
# Filter for high-performance games
high_wicket_games = df[df['Wickets_Taken'] >= 5]
# Violin plot for distribution of wickets in high-performance games
plt.figure(figsize=(12, 6))
sns.violinplot(x='Wickets_Taken', data=high_wicket_games)
plt.title('Distribution of Wickets Taken in High-Performance Games')
plt.xlabel('Wickets Taken')
plt.grid(True)
plt.show()
21. Who is the Purple Cap Winner of 2022?!¶
In [61]:
max_wickets_2022 = df[df['Year'] == 2022].nlargest(1, 'Wickets_Taken')
22. Identify the bowler with the lowest economy rate¶
In [63]:
lowest_economy_rate = df[(df['Economy_Rate'] > 0) & (df['Wickets_Taken'] > 0)].nsmallest(1, 'Economy_Rate')
23. Best Bowling Performance of that Player¶
In [66]:
best_bowling_performance = df[df['Player_Name'] == lowest_economy_rate.iloc[0]['Player_Name']].nlargest(1, 'Best_Bowling_Match')
Just for Exploration !¶
Every year we see a great number of new talents!
Let's plot a pie chart to understand the percentage of number of new debutants every year!
In [67]:
# Run the below cell for understanding the % of debutants every season!
In [68]:
df['Debut_Year'] = df.sort_values('Year').drop_duplicates('Player_Name')['Year']
debutants_count = df[df['Debut_Year'].notna()]['Debut_Year'].value_counts().sort_index()
plt.figure(figsize=(10, 8))
plt.pie(debutants_count, labels=debutants_count.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired(range(len(debutants_count))))
plt.title('Percentage of Debutants by Year')
plt.axis('equal')
plt.show()