Statement of Completion#c6fb4b2e
Intro to Pandas for Data Analysis
medium
Cocoa Curations: Series Filtering with Chocolate Ratings
Resolution
Activities
Cocoa Curations: Series Filtering with Chocolate Ratings¶
Let's Start¶
In [2]:
# Importing neccessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [3]:
# Loading the dataset
data = pd.read_csv('flavors_of_cocoa.csv')
In [4]:
data.head()
Out[4]:
Company \n(Maker-if known) | Specific Bean Origin\nor Bar Name | REF | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | Bean\nType | Broad Bean\nOrigin | |
---|---|---|---|---|---|---|---|---|---|
0 | A. Morin | Agua Grande | 1876 | 2016 | 63% | France | 3.75 | Sao Tome | |
1 | A. Morin | Kpime | 1676 | 2015 | 70% | France | 2.75 | Togo | |
2 | A. Morin | Atsane | 1676 | 2015 | 70% | France | 3.00 | Togo | |
3 | A. Morin | Akata | 1680 | 2015 | 70% | France | 3.50 | Togo | |
4 | A. Morin | Quilla | 1704 | 2015 | 70% | France | 3.50 | Peru |
In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1795 entries, 0 to 1794 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Company (Maker-if known) 1795 non-null object 1 Specific Bean Origin or Bar Name 1795 non-null object 2 REF 1795 non-null int64 3 Review Date 1795 non-null int64 4 Cocoa Percent 1795 non-null object 5 Company Location 1795 non-null object 6 Rating 1795 non-null float64 7 Bean Type 1794 non-null object 8 Broad Bean Origin 1794 non-null object dtypes: float64(1), int64(2), object(6) memory usage: 126.3+ KB
In [6]:
# Note the column have `\n` in their names so run the following line to know the column names
data.columns
Out[6]:
Index(['Company \n(Maker-if known)', 'Specific Bean Origin\nor Bar Name', 'REF', 'Review\nDate', 'Cocoa\nPercent', 'Company\nLocation', 'Rating', 'Bean\nType', 'Broad Bean\nOrigin'], dtype='object')
In [7]:
data.describe()
Out[7]:
REF | Review\nDate | Rating | |
---|---|---|---|
count | 1795.000000 | 1795.000000 | 1795.000000 |
mean | 1035.904735 | 2012.325348 | 3.185933 |
std | 552.886365 | 2.927210 | 0.478062 |
min | 5.000000 | 2006.000000 | 1.000000 |
25% | 576.000000 | 2010.000000 | 2.875000 |
50% | 1069.000000 | 2013.000000 | 3.250000 |
75% | 1502.000000 | 2015.000000 | 3.500000 |
max | 1952.000000 | 2017.000000 | 5.000000 |
1. Which of the following methods is used to find the top 5
chocolates based on their Rating
?¶
In [13]:
# Identify the top 5 chocolates based on their rating
data.nlargest(8, 'Rating')
Out[13]:
Company \n(Maker-if known) | Specific Bean Origin\nor Bar Name | REF | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | Bean\nType | Broad Bean\nOrigin | |
---|---|---|---|---|---|---|---|---|---|
78 | Amedei | Chuao | 111 | 2007 | 70% | Italy | 5.0 | Trinitario | Venezuela |
86 | Amedei | Toscano Black | 40 | 2006 | 70% | Italy | 5.0 | Blend | |
9 | A. Morin | Pablino | 1319 | 2014 | 70% | France | 4.0 | Peru | |
17 | A. Morin | Chuao | 1015 | 2013 | 70% | France | 4.0 | Trinitario | Venezuela |
20 | A. Morin | Chanchamayo Province | 1019 | 2013 | 63% | France | 4.0 | Peru | |
54 | Amano | Morobe | 725 | 2011 | 70% | U.S.A. | 4.0 | Papua New Guinea | |
56 | Amano | Guayas | 470 | 2010 | 70% | U.S.A. | 4.0 | Ecuador | |
76 | Amedei | Porcelana | 111 | 2007 | 70% | Italy | 4.0 | Criollo (Porcelana) | Venezuela |
2. Identify Low-Rated Chocolate Bars¶
In [29]:
low_rated_count = data[data['Rating'] <2].count()
In [36]:
data['Cocoa\nPercent'] = data['Cocoa\nPercent'].str.rstrip('%').astype(float)
In [38]:
high_cocoa_chocolates =data[data['Cocoa\nPercent'] >70]
In [39]:
high_cocoa_chocolates
Out[39]:
Company \n(Maker-if known) | Specific Bean Origin\nor Bar Name | REF | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | Bean\nType | Broad Bean\nOrigin | |
---|---|---|---|---|---|---|---|---|---|
26 | Adi | Vanua Levu, Toto-A | 705 | 2011 | 80.0 | Fiji | 3.25 | Trinitario | Fiji |
27 | Adi | Vanua Levu | 705 | 2011 | 88.0 | Fiji | 3.50 | Trinitario | Fiji |
28 | Adi | Vanua Levu, Ami-Ami-CA | 705 | 2011 | 72.0 | Fiji | 3.50 | Trinitario | Fiji |
32 | Akesson's (Pralus) | Bali (west), Sukrama Family, Melaya area | 636 | 2011 | 75.0 | Switzerland | 3.75 | Trinitario | Indonesia |
33 | Akesson's (Pralus) | Madagascar, Ambolikapiky P. | 502 | 2010 | 75.0 | Switzerland | 2.75 | Criollo | Madagascar |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1778 | Zotter | Raw | 1205 | 2014 | 80.0 | Austria | 2.75 | ||
1779 | Zotter | Bocas del Toro, Cocabo Co-op | 801 | 2012 | 72.0 | Austria | 3.50 | Panama | |
1784 | Zotter | El Oro | 879 | 2012 | 75.0 | Austria | 3.00 | Forastero (Nacional) | Ecuador |
1785 | Zotter | Huiwani Coop | 879 | 2012 | 75.0 | Austria | 3.00 | Criollo, Trinitario | Papua New Guinea |
1786 | Zotter | El Ceibo Coop | 879 | 2012 | 90.0 | Austria | 3.25 | Bolivia |
795 rows × 9 columns
3. High Cocoa Percent Chocolates¶
In [ ]:
# Converting the 'Cocoa\nPercent' column into a float for filtering
data['Cocoa\nPercent'] = data['Cocoa\nPercent'].str.rstrip('%').astype(float)
high_cocoa_chocolates = ...
4. Count Chocolates Above Average Rating¶
In [41]:
# calculate mean rating
mean_rating = data['Rating'].mean()
In [54]:
# Now create series storing count of chocolate bars whose rating is above than the mean rating we calculated
above_avg_chocolates = data[data['Rating'] > mean_rating].count().sort_values(ascending=True)
In [46]:
above_avg_chocolates
Out[46]:
Company \n(Maker-if known) 1005 Specific Bean Origin\nor Bar Name 1005 REF 1005 Review\nDate 1005 Cocoa\nPercent 1005 Company\nLocation 1005 Rating 1005 Bean\nType 1004 Broad Bean\nOrigin 1005 dtype: int64
Just for Exploration¶
Do you enjoy a bit of a bitter kick in your chocolate? I've heard the more cocoa solids, the more intense the bitterness. But, does a higher cocoa percentage translate into a better rating, or do our taste buds crave something different? Let’s dive in and see how the cocoa percentage really influences expert ratings. You might be surprised by what makes a chocolate bar truly elite!
In [58]:
# Create scatter plot for cocoa percentage vs rating
plt.figure(figsize=(16, 12))
sns.scatterplot(data=data, x='Cocoa\nPercent', y='Rating', hue='Cocoa\nPercent', palette='coolwarm')
plt.title('Cocoa Percentage vs. Chocolate Rating')
plt.xlabel('Cocoa Percent')
plt.ylabel('Rating')
plt.show()
5. Identify Beans with High Cocoa and High Rating!¶
In [137]:
filtered_chocolates_series = data[(data['Cocoa\nPercent'] > 60) & (data['Rating'] >= 4.0)]['Specific Bean Origin\nor Bar Name']
In [70]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1795 entries, 0 to 1794 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Company (Maker-if known) 1795 non-null object 1 Specific Bean Origin or Bar Name 1795 non-null object 2 REF 1795 non-null int64 3 Review Date 1795 non-null int64 4 Cocoa Percent 1795 non-null float64 5 Company Location 1795 non-null object 6 Rating 1795 non-null float64 7 Bean Type 1794 non-null object 8 Broad Bean Origin 1794 non-null object dtypes: float64(2), int64(2), object(5) memory usage: 126.3+ KB
In [74]:
data[(data['Cocoa\nPercent'] > 60) & (data['Rating'] >= 4.0)]['Specific Bean Origin\nor Bar Name']
Out[74]:
9 Pablino 17 Chuao 20 Chanchamayo Province 54 Morobe 56 Guayas ... 1687 Porcelana, Pedegral 1693 Manjari 1699 Guanaja 1739 Los Llanos 1756 Ocumare Name: Specific Bean Origin\nor Bar Name, Length: 99, dtype: object
In [63]:
data
Out[63]:
Company \n(Maker-if known) | Specific Bean Origin\nor Bar Name | REF | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | Bean\nType | Broad Bean\nOrigin | |
---|---|---|---|---|---|---|---|---|---|
0 | A. Morin | Agua Grande | 1876 | 2016 | 63.0 | France | 3.75 | Sao Tome | |
1 | A. Morin | Kpime | 1676 | 2015 | 70.0 | France | 2.75 | Togo | |
2 | A. Morin | Atsane | 1676 | 2015 | 70.0 | France | 3.00 | Togo | |
3 | A. Morin | Akata | 1680 | 2015 | 70.0 | France | 3.50 | Togo | |
4 | A. Morin | Quilla | 1704 | 2015 | 70.0 | France | 3.50 | Peru | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1790 | Zotter | Peru | 647 | 2011 | 70.0 | Austria | 3.75 | Peru | |
1791 | Zotter | Congo | 749 | 2011 | 65.0 | Austria | 3.00 | Forastero | Congo |
1792 | Zotter | Kerala State | 749 | 2011 | 65.0 | Austria | 3.50 | Forastero | India |
1793 | Zotter | Kerala State | 781 | 2011 | 62.0 | Austria | 3.25 | India | |
1794 | Zotter | Brazil, Mitzi Blue | 486 | 2010 | 65.0 | Austria | 3.00 | Brazil |
1795 rows × 9 columns
6. Count Extreme Chocolates¶
In [81]:
extreme_chocolates = data[(data['Rating'] < 2) | (data['Cocoa\nPercent'] >90)].count()
In [80]:
Out[80]:
Company \n(Maker-if known) 34 Specific Bean Origin\nor Bar Name 34 REF 34 Review\nDate 34 Cocoa\nPercent 34 Company\nLocation 34 Rating 34 Bean\nType 34 Broad Bean\nOrigin 34 dtype: int64
7. What is the correct syntax to filter chocolates with a rating greater than 4.5
and a cocoa percentage less than 70%
?¶
In [90]:
# Provide the correct syntax to filter chocolates with a rating greater than `4.5` and a cocoa percentage less than `70%`.
data[(data['Rating'] > 4.5) & (data['Cocoa\nPercent'] < 70)].count()
Out[90]:
Company \n(Maker-if known) 0 Specific Bean Origin\nor Bar Name 0 REF 0 Review\nDate 0 Cocoa\nPercent 0 Company\nLocation 0 Rating 0 Bean\nType 0 Broad Bean\nOrigin 0 dtype: int64
8. Count High-Rated Venezuelan Chocolates¶
In [98]:
venezuela_chocolates = data[(data['Broad Bean\nOrigin'] == 'Venezuela') & (data['Rating'] > 3.5)].count()
In [99]:
venezuela_chocolates
Out[99]:
Company \n(Maker-if known) 54 Specific Bean Origin\nor Bar Name 54 REF 54 Review\nDate 54 Cocoa\nPercent 54 Company\nLocation 54 Rating 54 Bean\nType 54 Broad Bean\nOrigin 54 dtype: int64
Just for Exploration¶
Ever wondered which country produces the finest chocolate? Is it the lush rainforests of South America, or perhaps the exotic plantations of Africa? This visualization uncovers the countries that consistently produce the highest-rated chocolate bars. You might find some surprising contenders that produce top-quality bars recognized by chocolate connoisseurs worldwide!
In [103]:
# Group by 'Company Location' and calculate the mean rating
top_countries = data.groupby('Company\nLocation')['Rating'].mean().sort_values(ascending=False).head(10)
# Plot the top 10 countries with highest-rated chocolates
plt.figure(figsize=(10, 6))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='viridis')
plt.title('Top 10 Countries Producing Highest-Rated Chocolate Bars')
plt.xlabel('Average Rating')
plt.ylabel('Company Location')
plt.show()
/tmp/ipykernel_21/2659750790.py:6: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=top_countries.values, y=top_countries.index, palette='viridis')
9. Recent High-Rated Bars¶
In [110]:
recent_high_rated_count = ...
In [116]:
recent_high_rated_count = data[(data['Review\nDate'] >2015) & (data['Rating'] >= 4)].count()
In [109]:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[109], line 1 ----> 1 data[(data['Review\nDate' >2015]) & (data['Rating'] >=4)].count() TypeError: '>' not supported between instances of 'str' and 'int'
10. Most Common Bean Origin for Highly Rated Chocolates¶
In [118]:
top_rated_common_origin = ...
In [122]:
data['Broad Bean\nOrigin'].mode()
Out[122]:
0 Venezuela Name: Broad Bean\nOrigin, dtype: object
In [126]:
data[data['Rating'] > 3.5]['Broad Bean\nOrigin'].mode().iloc[0]
Out[126]:
'Venezuela'
In [ ]:
11. Average Rating by Company Location¶
In [130]:
# Get unique company locations
unique_locations = data['Company\nLocation'].unique()
In [131]:
# Initialize a dictionary to store results
location_stats = {}
# Calculate stats for each location
for location in unique_locations:
location_data = data[data['Company\nLocation'] == location]
count = len(location_data)
if count >= 10: # Only consider locations with at least 10 reviews
mean_rating = location_data['Rating'].mean() # calculate mean rating
location_stats[location] = mean_rating
In [133]:
# Convert results to a Series and sort
avg_rating_by_location = pd.Series(location_stats).sort_values(ascending=False)
avg_rating_by_location
Out[133]:
Vietnam 3.409091 Brazil 3.397059 Australia 3.357143 Guatemala 3.350000 Switzerland 3.342105 Italy 3.325397 Scotland 3.325000 Canada 3.324000 Denmark 3.283333 Spain 3.270000 France 3.251603 Austria 3.240385 Hungary 3.204545 New Zealand 3.191176 Germany 3.178571 Venezuela 3.175000 Colombia 3.173913 U.S.A. 3.154123 Madagascar 3.147059 Belgium 3.093750 Japan 3.088235 U.K. 3.054688 Ecuador 3.009259 Peru 2.897059 dtype: float64