In [56]:

import pandas as pd
import numpy as np

Basic Analysis¶

In [57]:

wine_quality_df = pd.read_csv('winequality-red.csv', sep= ';')

In [58]:

wine_quality_df.head()

Out[58]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

In [59]:

wine_quality_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

In [60]:

wine_quality_df.describe()

Out[60]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000
mean	8.319637	0.527821	0.270976	2.538806	0.087467	15.874922	46.467792	0.996747	3.311113	0.658149	10.422983	5.636023
std	1.741096	0.179060	0.194801	1.409928	0.047065	10.460157	32.895324	0.001887	0.154386	0.169507	1.065668	0.807569
min	4.600000	0.120000	0.000000	0.900000	0.012000	1.000000	6.000000	0.990070	2.740000	0.330000	8.400000	3.000000
25%	7.100000	0.390000	0.090000	1.900000	0.070000	7.000000	22.000000	0.995600	3.210000	0.550000	9.500000	5.000000
50%	7.900000	0.520000	0.260000	2.200000	0.079000	14.000000	38.000000	0.996750	3.310000	0.620000	10.200000	6.000000
75%	9.200000	0.640000	0.420000	2.600000	0.090000	21.000000	62.000000	0.997835	3.400000	0.730000	11.100000	6.000000
max	15.900000	1.580000	1.000000	15.500000	0.611000	72.000000	289.000000	1.003690	4.010000	2.000000	14.900000	8.000000

To ensure the integrity of our original data set, it's a best practice to work with a copy of the data frame when performing data manipulation. By creating a copy, we can freely experiment with various techniques and make modifications without affecting the original data. This way, we can have peace of mind knowing that the original data set remains untouched.

In [61]:

df = wine_quality_df.copy()

1. What is maximum amount of citric acid in the wine? Enter the answer to 1 decimal point.

In [62]:

df['quality'].median()

Out[62]:

6.0

2. How many missing values are in the dataset?

In [63]:

df.isnull().sum().sum()

Out[63]:

3. What is the median wine quality?

In [64]:

df['quality'].median()

Out[64]:

6.0

Row and Column modification¶

4. Rename the columns to have underscore instead of space. For example old name: fixed acidity to new name: fixed_acidity

In [65]:

df.rename(columns = {"fixed acidity": "fixed_acidity",
                     "volatile acidity": "volatile_acidity",
                     "citric acid": "citric_acid",
                     "residual sugar": "residual_sugar",
                     "fixed acidity": "fixed_acidity",
                     "free sulfur dioxide": "free_sulfur_dioxide",
                     "total sulfur dioxide": "total_sulfur_dioxide"}, inplace=True)

In [ ]:

5. Drop the first and last row

In [66]:

df_first_last = df.drop([df.index[0],df.index[-1]])

6. The dataset contains an outlier. Remove the row where that contains the maximum total sulfur dioxide.

In [73]:

df_drop = df.drop(df[''])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[73], line 1
----> 1 df_drop = df.drop(df.loc[df['total_sulfur_dioxide']].idxmax())

File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
   1189 maybe_callable = com.apply_if_callable(key, self.obj)
   1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1420, in _LocIndexer._getitem_axis(self, key, axis)
   1417     if hasattr(key, "ndim") and key.ndim > 1:
   1418         raise ValueError("Cannot index with multidimensional key")
-> 1420     return self._getitem_iterable(key, axis=axis)
   1422 # nested tuple slicing
   1423 if is_nested_tuple(key, labels):

File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1360, in _LocIndexer._getitem_iterable(self, key, axis)
   1357 self._validate_key(key, axis)
   1359 # A collection of keys
-> 1360 keyarr, indexer = self._get_listlike_indexer(key, axis)
   1361 return self.obj._reindex_with_indexers(
   1362     {axis: [keyarr, indexer]}, copy=True, allow_dups=True
   1363 )

File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1558, in _LocIndexer._get_listlike_indexer(self, key, axis)
   1555 ax = self.obj._get_axis(axis)
   1556 axis_name = self.obj._get_axis_name(axis)
-> 1558 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
   1560 return keyarr, indexer

File /usr/local/lib/python3.11/site-packages/pandas/core/indexes/base.py:6200, in Index._get_indexer_strict(self, key, axis_name)
   6197 else:
   6198     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6200 self._raise_if_missing(keyarr, indexer, axis_name)
   6202 keyarr = self.take(indexer)
   6203 if isinstance(key, Index):
   6204     # GH 42790 - Preserve name from an Index

File /usr/local/lib/python3.11/site-packages/pandas/core/indexes/base.py:6252, in Index._raise_if_missing(self, key, indexer, axis_name)
   6249     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6251 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6252 raise KeyError(f"{not_found} not in index")

KeyError: '[77.5] not in index'

7. We notice that all the datatypes are float besides the quality column. Convert the column to float datatype

In [77]:

df["quality_float"] = df['quality'].astype('float64')

8. Remove these columns from the dataset

In [81]:

df_drop_three = df.drop([ 'density', 'residual_sugar', 'chlorides' ],axis=1)

Basic Column operations¶

9. Create a new column that calculates the alcohol content in terms of percentage (%)

In [85]:

df['alcohol_perc'] = df['alcohol']/df['alcohol'].max() *100

10. Create a new column in the data frame that contains the sum of sulfates and citric_acid for the red wine.

In [87]:

df['sulphate_citric_acid']= df['citric_acid'] + df['sulphates']

11. Create a new column that where the alcohol content is less than its mean.

In [88]:

df['deviation_alcohol'] = df['alcohol'] < df['alcohol'].mean()

12. Convert the wine quality scores into categorical labels: "low", "medium", "high".

In [90]:

df['quality_label'] = ['low' if x <= 5 else "medium" if x <= 7 else 'high' for x in df['quality']]

13. Create a new column that calculates the ratio of free sulfur dioxide to total sulfur dioxide.

In [94]:

df['free_total_ratio'] = df['free_sulfur_dioxide'] / df['total_sulfur_dioxide']

The End!¶

In [ ]:

Statement of Completion#27d4dde8

Intro to Pandas for Data Analysis

Mastering DataFrame Mutations with Wine Quality data

Basic Analysis¶

Row and Column modification¶

Basic Column operations¶

The End!¶