Statement of Completion#2157942b
Intro to Pandas for Data Analysis
medium
Cocoa Curations: Series Filtering with Chocolate Ratings
Resolution
Activities
Project.ipynb
Cocoa Curations: Series Filtering with Chocolate Ratings¶
Let's Start¶
In [1]:
# Importing neccessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
# Loading the dataset
data = pd.read_csv('flavors_of_cocoa.csv')
In [3]:
data.head()
Out[3]:
Company \n(Maker-if known) | Specific Bean Origin\nor Bar Name | REF | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | Bean\nType | Broad Bean\nOrigin | |
---|---|---|---|---|---|---|---|---|---|
0 | A. Morin | Agua Grande | 1876 | 2016 | 63% | France | 3.75 | Sao Tome | |
1 | A. Morin | Kpime | 1676 | 2015 | 70% | France | 2.75 | Togo | |
2 | A. Morin | Atsane | 1676 | 2015 | 70% | France | 3.00 | Togo | |
3 | A. Morin | Akata | 1680 | 2015 | 70% | France | 3.50 | Togo | |
4 | A. Morin | Quilla | 1704 | 2015 | 70% | France | 3.50 | Peru |
In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1795 entries, 0 to 1794 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Company (Maker-if known) 1795 non-null object 1 Specific Bean Origin or Bar Name 1795 non-null object 2 REF 1795 non-null int64 3 Review Date 1795 non-null int64 4 Cocoa Percent 1795 non-null object 5 Company Location 1795 non-null object 6 Rating 1795 non-null float64 7 Bean Type 1794 non-null object 8 Broad Bean Origin 1794 non-null object dtypes: float64(1), int64(2), object(6) memory usage: 126.3+ KB
In [5]:
# Note the column have `\n` in their names so run the following line to know the column names
data.columns
Out[5]:
Index(['Company \n(Maker-if known)', 'Specific Bean Origin\nor Bar Name', 'REF', 'Review\nDate', 'Cocoa\nPercent', 'Company\nLocation', 'Rating', 'Bean\nType', 'Broad Bean\nOrigin'], dtype='object')
In [6]:
data.describe()
Out[6]:
REF | Review\nDate | Rating | |
---|---|---|---|
count | 1795.000000 | 1795.000000 | 1795.000000 |
mean | 1035.904735 | 2012.325348 | 3.185933 |
std | 552.886365 | 2.927210 | 0.478062 |
min | 5.000000 | 2006.000000 | 1.000000 |
25% | 576.000000 | 2010.000000 | 2.875000 |
50% | 1069.000000 | 2013.000000 | 3.250000 |
75% | 1502.000000 | 2015.000000 | 3.500000 |
max | 1952.000000 | 2017.000000 | 5.000000 |
1. Which of the following methods is used to find the top 5
chocolates based on their Rating
?¶
In [16]:
# Identify the top 5 chocolates based on their rating
In [17]:
data.nsmallest(5, "Rating")
Out[17]:
Company \n(Maker-if known) | Specific Bean Origin\nor Bar Name | REF | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | Bean\nType | Broad Bean\nOrigin | |
---|---|---|---|---|---|---|---|---|---|
326 | Callebaut | Baking | 141 | 2007 | 70% | Belgium | 1.0 | Ecuador | |
437 | Claudio Corallo | Principe | 252 | 2008 | 100% | Sao Tome | 1.0 | Forastero | Sao Tome & Principe |
465 | Cote d' Or (Kraft) | Sensations Intense | 48 | 2006 | 70% | Belgium | 1.0 | ||
1175 | Neuhaus (Callebaut) | Dark | 135 | 2007 | 73% | Belgium | 1.0 | ||
245 | Bonnat | One Hundred | 81 | 2006 | 100% | France | 1.5 |
2. Identify Low-Rated Chocolate Bars¶
In [21]:
data.loc[data["Rating"] < 2].count()
Out[21]:
Company \n(Maker-if known) 17 Specific Bean Origin\nor Bar Name 17 REF 17 Review\nDate 17 Cocoa\nPercent 17 Company\nLocation 17 Rating 17 Bean\nType 17 Broad Bean\nOrigin 17 dtype: int64
In [22]:
low_rated_count = data.loc[data["Rating"] < 2].count()
3. High Cocoa Percent Chocolates¶
In [ ]:
# Converting the 'Cocoa\nPercent' column into a float for filtering
data['Cocoa\nPercent'] = data['Cocoa\nPercent'].str.rstrip('%').astype(float)
high_cocoa_chocolates = ...
In [24]:
data['Cocoa\nPercent'] = data['Cocoa\nPercent'].str.rstrip('%').astype(float)
In [27]:
data.loc[data["Cocoa\nPercent"] > 70]
Out[27]:
Company \n(Maker-if known) | Specific Bean Origin\nor Bar Name | REF | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | Bean\nType | Broad Bean\nOrigin | |
---|---|---|---|---|---|---|---|---|---|
26 | Adi | Vanua Levu, Toto-A | 705 | 2011 | 80.0 | Fiji | 3.25 | Trinitario | Fiji |
27 | Adi | Vanua Levu | 705 | 2011 | 88.0 | Fiji | 3.50 | Trinitario | Fiji |
28 | Adi | Vanua Levu, Ami-Ami-CA | 705 | 2011 | 72.0 | Fiji | 3.50 | Trinitario | Fiji |
32 | Akesson's (Pralus) | Bali (west), Sukrama Family, Melaya area | 636 | 2011 | 75.0 | Switzerland | 3.75 | Trinitario | Indonesia |
33 | Akesson's (Pralus) | Madagascar, Ambolikapiky P. | 502 | 2010 | 75.0 | Switzerland | 2.75 | Criollo | Madagascar |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1778 | Zotter | Raw | 1205 | 2014 | 80.0 | Austria | 2.75 | ||
1779 | Zotter | Bocas del Toro, Cocabo Co-op | 801 | 2012 | 72.0 | Austria | 3.50 | Panama | |
1784 | Zotter | El Oro | 879 | 2012 | 75.0 | Austria | 3.00 | Forastero (Nacional) | Ecuador |
1785 | Zotter | Huiwani Coop | 879 | 2012 | 75.0 | Austria | 3.00 | Criollo, Trinitario | Papua New Guinea |
1786 | Zotter | El Ceibo Coop | 879 | 2012 | 90.0 | Austria | 3.25 | Bolivia |
795 rows × 9 columns
4. Count Chocolates Above Average Rating¶
In [29]:
mean_rating = data["Rating"].mean()
mean_rating
Out[29]:
3.185933147632312
In [30]:
# calculate mean rating
mean_rating = data["Rating"].mean()
# Now create series storing count of chocolate bars whose rating is above than the mean rating we calculated
above_avg_chocolates = data.loc[data["Rating"] > mean_rating].count()
Just for Exploration¶
Do you enjoy a bit of a bitter kick in your chocolate? I've heard the more cocoa solids, the more intense the bitterness. But, does a higher cocoa percentage translate into a better rating, or do our taste buds crave something different? Let’s dive in and see how the cocoa percentage really influences expert ratings. You might be surprised by what makes a chocolate bar truly elite!
In [32]:
# Create scatter plot for cocoa percentage vs rating
plt.figure(figsize=(16, 12))
sns.scatterplot(data=data, x='Cocoa\nPercent', y='Rating', hue='Cocoa\nPercent', palette='coolwarm')
plt.title('Cocoa Percentage vs. Chocolate Rating')
plt.xlabel('Cocoa Percent')
plt.ylabel('Rating')
plt.show()
5. Identify Beans with High Cocoa and High Rating!¶
In [41]:
data.head()
Out[41]:
Company \n(Maker-if known) | Specific Bean Origin\nor Bar Name | REF | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | Bean\nType | Broad Bean\nOrigin | |
---|---|---|---|---|---|---|---|---|---|
0 | A. Morin | Agua Grande | 1876 | 2016 | 63.0 | France | 3.75 | Sao Tome | |
1 | A. Morin | Kpime | 1676 | 2015 | 70.0 | France | 2.75 | Togo | |
2 | A. Morin | Atsane | 1676 | 2015 | 70.0 | France | 3.00 | Togo | |
3 | A. Morin | Akata | 1680 | 2015 | 70.0 | France | 3.50 | Togo | |
4 | A. Morin | Quilla | 1704 | 2015 | 70.0 | France | 3.50 | Peru |
In [43]:
filtered_chocolates_series = data.loc[
(data["Cocoa\nPercent"] > 60) & (data["Rating"] >= 4)
]["Specific Bean Origin\nor Bar Name"]
In [44]:
filtered_chocolates_series
Out[44]:
9 Pablino 17 Chuao 20 Chanchamayo Province 54 Morobe 56 Guayas ... 1687 Porcelana, Pedegral 1693 Manjari 1699 Guanaja 1739 Los Llanos 1756 Ocumare Name: Specific Bean Origin\nor Bar Name, Length: 99, dtype: object
6. Count Extreme Chocolates¶
In [48]:
extreme_chocolates = data.loc[
(data["Rating"] < 2) | (data["Cocoa\nPercent"] >90)
].count()
In [50]:
extreme_chocolates
Out[50]:
Company \n(Maker-if known) 34 Specific Bean Origin\nor Bar Name 34 REF 34 Review\nDate 34 Cocoa\nPercent 34 Company\nLocation 34 Rating 34 Bean\nType 34 Broad Bean\nOrigin 34 dtype: int64
7. What is the correct syntax to filter chocolates with a rating greater than 4.5
and a cocoa percentage less than 70%
?¶
In [ ]:
# Provide the correct syntax to filter chocolates with a rating greater than `4.5` and a cocoa percentage less than `70%`.
8. Count High-Rated Venezuelan Chocolates¶
In [51]:
data.head(1)
Out[51]:
Company \n(Maker-if known) | Specific Bean Origin\nor Bar Name | REF | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | Bean\nType | Broad Bean\nOrigin | |
---|---|---|---|---|---|---|---|---|---|
0 | A. Morin | Agua Grande | 1876 | 2016 | 63.0 | France | 3.75 | Sao Tome |
In [52]:
venezuela_chocolates = data.loc[
(data["Broad Bean\nOrigin"]=="Venezuela") & (data["Rating"]>3.5)
].count()
Just for Exploration¶
Ever wondered which country produces the finest chocolate? Is it the lush rainforests of South America, or perhaps the exotic plantations of Africa? This visualization uncovers the countries that consistently produce the highest-rated chocolate bars. You might find some surprising contenders that produce top-quality bars recognized by chocolate connoisseurs worldwide!
In [56]:
# Group by 'Company Location' and calculate the mean rating
top_countries = data.groupby('Company\nLocation')['Rating'].mean().sort_values(ascending=False).head(10)
# Plot the top 10 countries with highest-rated chocolates
plt.figure(figsize=(10, 6))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='viridis')
plt.title('Top 10 Countries Producing Highest-Rated Chocolate Bars')
plt.xlabel('Average Rating')
plt.ylabel('Company Location')
plt.show()
/tmp/ipykernel_17/2659750790.py:6: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=top_countries.values, y=top_countries.index, palette='viridis')
9. Recent High-Rated Bars¶
In [57]:
data.head(1)
Out[57]:
Company \n(Maker-if known) | Specific Bean Origin\nor Bar Name | REF | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | Bean\nType | Broad Bean\nOrigin | |
---|---|---|---|---|---|---|---|---|---|
0 | A. Morin | Agua Grande | 1876 | 2016 | 63.0 | France | 3.75 | Sao Tome |
In [62]:
recent_high_rated_count = data.loc[
(data["Review\nDate"]>2015) & (data["Rating"]>=4)
].count()
10. Most Common Bean Origin for Highly Rated Chocolates¶
In [76]:
top_rated_common_origin = data.loc[
data["Rating"]>3.5
]['Broad Bean\nOrigin'].mode().iloc[0]
In [77]:
top_rated_common_origin
Out[77]:
'Venezuela'
11. Average Rating by Company Location¶
In [85]:
# Get unique company locations
unique_locations = data["Company\nLocation"].unique()
In [78]:
data.head(1)
Out[78]:
Company \n(Maker-if known) | Specific Bean Origin\nor Bar Name | REF | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | Bean\nType | Broad Bean\nOrigin | |
---|---|---|---|---|---|---|---|---|---|
0 | A. Morin | Agua Grande | 1876 | 2016 | 63.0 | France | 3.75 | Sao Tome |
In [84]:
data["Company\nLocation"].value_counts()
Out[84]:
Company\nLocation U.S.A. 764 France 156 Canada 125 U.K. 96 Italy 63 Ecuador 54 Australia 49 Belgium 40 Switzerland 38 Germany 35 Austria 26 Spain 25 Colombia 23 Hungary 22 Venezuela 20 Peru 17 New Zealand 17 Madagascar 17 Japan 17 Brazil 17 Denmark 15 Vietnam 11 Guatemala 10 Scotland 10 Argentina 9 Israel 9 Costa Rica 9 Poland 8 Honduras 6 Lithuania 6 South Korea 5 Nicaragua 5 Sweden 5 Domincan Republic 5 Netherlands 4 Mexico 4 Puerto Rico 4 Fiji 4 Sao Tome 4 Amsterdam 4 Ireland 4 South Africa 3 Singapore 3 Iceland 3 Portugal 3 Grenada 3 Finland 2 St. Lucia 2 Chile 2 Bolivia 2 Wales 1 Russia 1 Martinique 1 Czech Republic 1 India 1 Philippines 1 Ghana 1 Niacragua 1 Eucador 1 Suriname 1 Name: count, dtype: int64
In [86]:
# Initialize a dictionary to store results
location_stats = {}
# Calculate stats for each location
for location in unique_locations:
location_data = data[data['Company\nLocation'] == location]
count = len(location_data)
if count >= 10 : # Only consider locations with at least 10 reviews
mean_rating = location_data["Rating"].mean() # calculate mean rating
location_stats[location] = mean_rating
In [87]:
location_stats
Out[87]:
{'France': 3.2516025641025643, 'U.S.A.': 3.1541230366492146, 'Ecuador': 3.009259259259259, 'Switzerland': 3.3421052631578947, 'Spain': 3.27, 'Peru': 2.8970588235294117, 'Canada': 3.324, 'Italy': 3.3253968253968256, 'Brazil': 3.3970588235294117, 'U.K.': 3.0546875, 'Australia': 3.357142857142857, 'Belgium': 3.09375, 'Germany': 3.1785714285714284, 'Venezuela': 3.175, 'Colombia': 3.1739130434782608, 'Japan': 3.088235294117647, 'New Zealand': 3.1911764705882355, 'Scotland': 3.325, 'Guatemala': 3.35, 'Denmark': 3.283333333333333, 'Vietnam': 3.409090909090909, 'Madagascar': 3.1470588235294117, 'Austria': 3.2403846153846154, 'Hungary': 3.2045454545454546}
In [110]:
# Convert results to a Series and sort
avg_rating_by_location = pd.Series(location_stats).sort_values(ascending=False)
In [109]:
avg_rating_by_location
Out[109]:
Vietnam 3.409091 Brazil 3.397059 Australia 3.357143 Guatemala 3.350000 Switzerland 3.342105 dtype: float64