Statement of Completion#01b8a03b
Intro to Pandas for Data Analysis
medium
Data at Sea: Series Operations on the Titanic Dataset
Resolution
Activities
Introduction¶
In [7]:
# importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# loading the dataset
df = pd.read_csv('titanic.csv')
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 887 entries, 0 to 886 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 887 non-null int64 1 Pclass 887 non-null int64 2 Name 887 non-null object 3 Sex 887 non-null object 4 Age 887 non-null float64 5 Siblings/Spouses Aboard 887 non-null int64 6 Parents/Children Aboard 887 non-null int64 7 Fare 887 non-null float64 dtypes: float64(2), int64(4), object(2) memory usage: 55.6+ KB
In [3]:
df.head()
Out[3]:
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 |
1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 |
2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 |
Warm Up Activities¶
1. What is the primary advantage of using vectorized operations in pandas compared to traditional loops?¶
In [ ]:
2. In the Titanic dataset, which of the following operations would NOT
be considered a vectorized operation?¶
In [4]:
df['Fare'] > 50
Out[4]:
0 False 1 True 2 False 3 True 4 False ... 882 False 883 False 884 False 885 False 886 False Name: Fare, Length: 887, dtype: bool
3. What is average age of passengers?¶
In [5]:
average_age = df['Age'].mean()
average_age
Out[5]:
29.471443066516347
In [10]:
Out[10]:
29.471443066516347
4. Who were the survivors ?¶
In [6]:
df['Age']=df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Age']
/tmp/ipykernel_602/2126654435.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. df['Age']=df['Age'].fillna(df['Age'].mean(), inplace=True)
Out[6]:
0 None 1 None 2 None 3 None 4 None ... 882 None 883 None 884 None 885 None 886 None Name: Age, Length: 887, dtype: object
In [16]:
df[]
Out[16]:
Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | NewCol | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 | 14.5000 |
1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 | 142.5666 |
2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 | 15.8500 |
3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 | 106.2000 |
4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 | 16.1000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
882 | 0 | 2 | Rev. Juozas Montvila | male | 27.0 | 0 | 0 | 13.0000 | 26.0000 |
883 | 1 | 1 | Miss. Margaret Edith Graham | female | 19.0 | 0 | 0 | 30.0000 | 60.0000 |
884 | 0 | 3 | Miss. Catherine Helen Johnston | female | 7.0 | 1 | 2 | 23.4500 | 46.9000 |
885 | 1 | 1 | Mr. Karl Howell Behr | male | 26.0 | 0 | 0 | 30.0000 | 60.0000 |
886 | 0 | 3 | Mr. Patrick Dooley | male | 32.0 | 0 | 0 | 7.7500 | 15.5000 |
887 rows × 9 columns
5. What does the following code do?¶
In [ ]:
Just for Exploration¶
Does wealth determine fate in a crisis? Let’s dive into the class system aboard the Titanic by visualizing survival rates based on passenger class. This bar chart paints a vivid picture of how socio-economic status may have played a role in determining who lived and who perished during one of history's most infamous maritime disasters.
In [25]:
# Plot survival rate by passenger class
survival_by_class = df.groupby('Pclass')['Survived'].mean().reset_index()
sns.barplot(x='Pclass', y='Survived', data=survival_by_class)
plt.title('Survival Rate by Passenger Class')
plt.ylabel('Survival Rate')
plt.xlabel('Passenger Class')
plt.show()
First-class passengers were significantly more likely to survive, with over 60% survival rate, whereas third-class passengers faced the greatest risk. This stark contrast highlights the impact of social class on survival during the tragedy.
Activities¶
6. Calculating the age difference¶
In [9]:
mean_age = df['Age'].mean()
age_difference = df['Age'] - mean_age
age_difference
Out[9]:
0 -7.471443 1 8.528557 2 -3.471443 3 5.528557 4 5.528557 ... 882 -2.471443 883 -10.471443 884 -22.471443 885 -3.471443 886 2.528557 Name: Age, Length: 887, dtype: float64
7. Normalize the Fare
column¶
In [13]:
fare_min = df['Fare'].min() # min fare
fare_max = df['Fare'].max() # max fare
normalized_fare = (df['Fare'] - fare_min) / (fare_max - fare_min) # normalized fare
normalized_fare
Out[13]:
0 0.014151 1 0.139136 2 0.015469 3 0.103644 4 0.015713 ... 882 0.025374 883 0.058556 884 0.045771 885 0.058556 886 0.015127 Name: Fare, Length: 887, dtype: float64
8. Calculate family size¶
In [19]:
family_size = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard'] +1
family_size
Out[19]:
0 2 1 2 2 1 3 2 4 1 .. 882 1 883 1 884 4 885 1 886 1 Length: 887, dtype: int64
In [21]:
Out[21]:
0 3.62500 1 35.64165 2 7.92500 3 26.55000 4 8.05000 ... 882 13.00000 883 30.00000 884 5.86250 885 30.00000 886 7.75000 Length: 887, dtype: float64
9. Calculate Fare Per Family Member¶
In [22]:
fare_per_family_member = df['Fare'] / family_size
fare_per_family_member
Out[22]:
0 3.62500 1 35.64165 2 7.92500 3 26.55000 4 8.05000 ... 882 13.00000 883 30.00000 884 5.86250 885 30.00000 886 7.75000 Length: 887, dtype: float64
Just for Exploration¶
Curious about the age makeup of passengers in each class aboard the Titanic? This histogram will take you through the age distribution across the different passenger classes, revealing the dynamics between age and class structure during this fateful journey.
In [24]:
plt.figure(figsize=(10,6))
sns.histplot(df, x='Age', hue='Pclass', multiple='stack', bins=20, palette='coolwarm')
plt.title('Age Distribution by Passenger Class')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()
10. Calculate Weighted Age Using Fare Weight¶
In [26]:
fare_weight = df['Fare'] / df['Fare'].max()
weighted_age = df['Age'] * fare_weight
weighted_age
Out[26]:
0 0.311323 1 5.287158 2 0.402183 3 3.627550 4 0.549939 ... 882 0.685106 883 1.112566 884 0.320399 885 1.522459 886 0.484064 Length: 887, dtype: float64
11. Calculate Cumulative Fare Percentage¶
In [37]:
sorted_fares = df['Fare'].sort_values()
cumulative_fare_percentage = (sorted_fares.cumsum() / sorted_fares.sum()) * 100
In [36]:
Out[36]:
799 0.001607 751 0.004170 466 0.007039 641 0.009908 827 0.013083 ... 115 98.867686 95 99.139289 490 99.410891 847 99.693969 627 100.000000 Name: Age, Length: 887, dtype: float64
In [ ]:
12. Identify Fare Outliers Using IQR¶
In [42]:
Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
is_fare_outlier = (df['Fare'] < (Q1 - 1.5 * IQR)) | (df['Fare'] > (Q3 + 1.5 * IQR))
is_fare_outlier
Out[42]:
0 False 1 True 2 False 3 False 4 False ... 882 False 883 False 884 False 885 False 886 False Name: Fare, Length: 887, dtype: bool
In [46]:
sorted_df = df.sort_index()
rolling_average_fare = sorted_df['Fare'].rolling(window=10, min_periods=1).mean()
rolling_average_fare
Out[46]:
0 7.250000 1 39.266650 2 28.819433 3 34.889575 4 29.521660 ... 882 20.303740 883 22.514160 884 24.069580 885 18.753750 886 16.928750 Name: Fare, Length: 887, dtype: float64
13. Calculate Rolling Average of Fare¶
In [ ]:
sorted_df = ...
rolling_average_fare = ...
Just for Exploration¶
What role does age play in the odds of survival? Let’s take a closer look at how age influenced survival during the Titanic disaster. This distribution plot will reveal whether certain age groups were more likely to make it through the tragedy, uncovering patterns that history may have hidden beneath the waves.
In [44]:
import seaborn as sns
# Create a KDE plot for the age distribution of survivors vs non-survivors
plt.figure(figsize=(10, 6))
sns.kdeplot(df[df['Survived'] == 1]['Age'], label='Survived', color='green')
sns.kdeplot(df[df['Survived'] == 0]['Age'], label='Did Not Survive', color='red')
# Add labels and title
plt.title('Age Distribution for Survivors vs Non-Survivors', fontsize=16)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.legend(title='Survival Status')
plt.show()
The age distribution reveals that younger passengers had a slightly higher survival rate. Interestingly, a noticeable drop in survival is observed around mid-adulthood, showing a subtle shift in survival patterns across different age groups.