Statement of Completion#27d4dde8
Intro to Pandas for Data Analysis
easy
Mastering DataFrame Mutations with Wine Quality data
Resolution
Activities
Project.ipynb
In [56]:
import pandas as pd
import numpy as np
Basic Analysis¶
In [57]:
wine_quality_df = pd.read_csv('winequality-red.csv', sep= ';')
In [58]:
wine_quality_df.head()
Out[58]:
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
In [59]:
wine_quality_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 1599 non-null float64 1 volatile acidity 1599 non-null float64 2 citric acid 1599 non-null float64 3 residual sugar 1599 non-null float64 4 chlorides 1599 non-null float64 5 free sulfur dioxide 1599 non-null float64 6 total sulfur dioxide 1599 non-null float64 7 density 1599 non-null float64 8 pH 1599 non-null float64 9 sulphates 1599 non-null float64 10 alcohol 1599 non-null float64 11 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
In [60]:
wine_quality_df.describe()
Out[60]:
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 |
mean | 8.319637 | 0.527821 | 0.270976 | 2.538806 | 0.087467 | 15.874922 | 46.467792 | 0.996747 | 3.311113 | 0.658149 | 10.422983 | 5.636023 |
std | 1.741096 | 0.179060 | 0.194801 | 1.409928 | 0.047065 | 10.460157 | 32.895324 | 0.001887 | 0.154386 | 0.169507 | 1.065668 | 0.807569 |
min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.740000 | 0.330000 | 8.400000 | 3.000000 |
25% | 7.100000 | 0.390000 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 0.995600 | 3.210000 | 0.550000 | 9.500000 | 5.000000 |
50% | 7.900000 | 0.520000 | 0.260000 | 2.200000 | 0.079000 | 14.000000 | 38.000000 | 0.996750 | 3.310000 | 0.620000 | 10.200000 | 6.000000 |
75% | 9.200000 | 0.640000 | 0.420000 | 2.600000 | 0.090000 | 21.000000 | 62.000000 | 0.997835 | 3.400000 | 0.730000 | 11.100000 | 6.000000 |
max | 15.900000 | 1.580000 | 1.000000 | 15.500000 | 0.611000 | 72.000000 | 289.000000 | 1.003690 | 4.010000 | 2.000000 | 14.900000 | 8.000000 |
To ensure the integrity of our original data set, it's a best practice to work with a copy of the data frame when performing data manipulation. By creating a copy, we can freely experiment with various techniques and make modifications without affecting the original data. This way, we can have peace of mind knowing that the original data set remains untouched.
In [61]:
df = wine_quality_df.copy()
1. What is maximum amount of citric acid in the wine? Enter the answer to 1 decimal point.
In [62]:
df['quality'].median()
Out[62]:
6.0
2. How many missing values are in the dataset?
In [63]:
df.isnull().sum().sum()
Out[63]:
0
3. What is the median wine quality?
In [64]:
df['quality'].median()
Out[64]:
6.0
Row and Column modification¶
4. Rename the columns to have underscore instead of space. For example old name: fixed acidity to new name: fixed_acidity
In [65]:
df.rename(columns = {"fixed acidity": "fixed_acidity",
"volatile acidity": "volatile_acidity",
"citric acid": "citric_acid",
"residual sugar": "residual_sugar",
"fixed acidity": "fixed_acidity",
"free sulfur dioxide": "free_sulfur_dioxide",
"total sulfur dioxide": "total_sulfur_dioxide"}, inplace=True)
In [ ]:
5. Drop the first and last row
In [66]:
df_first_last = df.drop([df.index[0],df.index[-1]])
6. The dataset contains an outlier. Remove the row where that contains the maximum total sulfur dioxide.
In [73]:
df_drop = df.drop(df[''])
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[73], line 1 ----> 1 df_drop = df.drop(df.loc[df['total_sulfur_dioxide']].idxmax()) File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key) 1189 maybe_callable = com.apply_if_callable(key, self.obj) 1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable) -> 1191 return self._getitem_axis(maybe_callable, axis=axis) File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1420, in _LocIndexer._getitem_axis(self, key, axis) 1417 if hasattr(key, "ndim") and key.ndim > 1: 1418 raise ValueError("Cannot index with multidimensional key") -> 1420 return self._getitem_iterable(key, axis=axis) 1422 # nested tuple slicing 1423 if is_nested_tuple(key, labels): File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1360, in _LocIndexer._getitem_iterable(self, key, axis) 1357 self._validate_key(key, axis) 1359 # A collection of keys -> 1360 keyarr, indexer = self._get_listlike_indexer(key, axis) 1361 return self.obj._reindex_with_indexers( 1362 {axis: [keyarr, indexer]}, copy=True, allow_dups=True 1363 ) File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1558, in _LocIndexer._get_listlike_indexer(self, key, axis) 1555 ax = self.obj._get_axis(axis) 1556 axis_name = self.obj._get_axis_name(axis) -> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name) 1560 return keyarr, indexer File /usr/local/lib/python3.11/site-packages/pandas/core/indexes/base.py:6200, in Index._get_indexer_strict(self, key, axis_name) 6197 else: 6198 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr) -> 6200 self._raise_if_missing(keyarr, indexer, axis_name) 6202 keyarr = self.take(indexer) 6203 if isinstance(key, Index): 6204 # GH 42790 - Preserve name from an Index File /usr/local/lib/python3.11/site-packages/pandas/core/indexes/base.py:6252, in Index._raise_if_missing(self, key, indexer, axis_name) 6249 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique()) -> 6252 raise KeyError(f"{not_found} not in index") KeyError: '[77.5] not in index'
7. We notice that all the datatypes are float besides the quality column. Convert the column to float datatype
In [77]:
df["quality_float"] = df['quality'].astype('float64')
8. Remove these columns from the dataset
In [81]:
df_drop_three = df.drop([ 'density', 'residual_sugar', 'chlorides' ],axis=1)
Basic Column operations¶
9. Create a new column that calculates the alcohol content in terms of percentage (%)
In [85]:
df['alcohol_perc'] = df['alcohol']/df['alcohol'].max() *100
10. Create a new column in the data frame that contains the sum of sulfates and citric_acid for the red wine.
In [87]:
df['sulphate_citric_acid']= df['citric_acid'] + df['sulphates']
11. Create a new column that where the alcohol content is less than its mean.
In [88]:
df['deviation_alcohol'] = df['alcohol'] < df['alcohol'].mean()
12. Convert the wine quality scores into categorical labels: "low", "medium", "high".
In [90]:
df['quality_label'] = ['low' if x <= 5 else "medium" if x <= 7 else 'high' for x in df['quality']]
13. Create a new column that calculates the ratio of free sulfur dioxide to total sulfur dioxide.
In [94]:
df['free_total_ratio'] = df['free_sulfur_dioxide'] / df['total_sulfur_dioxide']
The End!¶
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: