Statement of Completion#01b8a03b
Intro to Pandas for Data Analysis
medium
Data at Sea: Series Operations on the Titanic Dataset
Resolution
Activities
Introduction¶
In [7]:
# importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# loading the dataset
df = pd.read_csv('titanic.csv')
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 887 entries, 0 to 886 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 887 non-null int64 1 Pclass 887 non-null int64 2 Name 887 non-null object 3 Sex 887 non-null object 4 Age 887 non-null float64 5 Siblings/Spouses Aboard 887 non-null int64 6 Parents/Children Aboard 887 non-null int64 7 Fare 887 non-null float64 dtypes: float64(2), int64(4), object(2) memory usage: 55.6+ KB
In [3]:
df.head()
Out[3]:
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 |
| 1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 |
| 2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
| 3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
| 4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 |
Warm Up Activities¶
1. What is the primary advantage of using vectorized operations in pandas compared to traditional loops?¶
In [ ]:
2. In the Titanic dataset, which of the following operations would NOT be considered a vectorized operation?¶
In [4]:
df['Fare'] > 50
Out[4]:
0 False
1 True
2 False
3 True
4 False
...
882 False
883 False
884 False
885 False
886 False
Name: Fare, Length: 887, dtype: bool
3. What is average age of passengers?¶
In [5]:
average_age = df['Age'].mean()
average_age
Out[5]:
29.471443066516347
In [10]:
Out[10]:
29.471443066516347
4. Who were the survivors ?¶
In [6]:
df['Age']=df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Age']
/tmp/ipykernel_602/2126654435.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
df['Age']=df['Age'].fillna(df['Age'].mean(), inplace=True)
Out[6]:
0 None
1 None
2 None
3 None
4 None
...
882 None
883 None
884 None
885 None
886 None
Name: Age, Length: 887, dtype: object
In [16]:
df[]
Out[16]:
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | NewCol | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 | 14.5000 |
| 1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 | 142.5666 |
| 2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 | 15.8500 |
| 3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 | 106.2000 |
| 4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 | 16.1000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 882 | 0 | 2 | Rev. Juozas Montvila | male | 27.0 | 0 | 0 | 13.0000 | 26.0000 |
| 883 | 1 | 1 | Miss. Margaret Edith Graham | female | 19.0 | 0 | 0 | 30.0000 | 60.0000 |
| 884 | 0 | 3 | Miss. Catherine Helen Johnston | female | 7.0 | 1 | 2 | 23.4500 | 46.9000 |
| 885 | 1 | 1 | Mr. Karl Howell Behr | male | 26.0 | 0 | 0 | 30.0000 | 60.0000 |
| 886 | 0 | 3 | Mr. Patrick Dooley | male | 32.0 | 0 | 0 | 7.7500 | 15.5000 |
887 rows × 9 columns
5. What does the following code do?¶
In [ ]:
Just for Exploration¶
Does wealth determine fate in a crisis? Let’s dive into the class system aboard the Titanic by visualizing survival rates based on passenger class. This bar chart paints a vivid picture of how socio-economic status may have played a role in determining who lived and who perished during one of history's most infamous maritime disasters.
In [25]:
# Plot survival rate by passenger class
survival_by_class = df.groupby('Pclass')['Survived'].mean().reset_index()
sns.barplot(x='Pclass', y='Survived', data=survival_by_class)
plt.title('Survival Rate by Passenger Class')
plt.ylabel('Survival Rate')
plt.xlabel('Passenger Class')
plt.show()
First-class passengers were significantly more likely to survive, with over 60% survival rate, whereas third-class passengers faced the greatest risk. This stark contrast highlights the impact of social class on survival during the tragedy.
Activities¶
6. Calculating the age difference¶
In [9]:
mean_age = df['Age'].mean()
age_difference = df['Age'] - mean_age
age_difference
Out[9]:
0 -7.471443
1 8.528557
2 -3.471443
3 5.528557
4 5.528557
...
882 -2.471443
883 -10.471443
884 -22.471443
885 -3.471443
886 2.528557
Name: Age, Length: 887, dtype: float64
7. Normalize the Fare column¶
In [13]:
fare_min = df['Fare'].min() # min fare
fare_max = df['Fare'].max() # max fare
normalized_fare = (df['Fare'] - fare_min) / (fare_max - fare_min) # normalized fare
normalized_fare
Out[13]:
0 0.014151
1 0.139136
2 0.015469
3 0.103644
4 0.015713
...
882 0.025374
883 0.058556
884 0.045771
885 0.058556
886 0.015127
Name: Fare, Length: 887, dtype: float64
8. Calculate family size¶
In [19]:
family_size = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard'] +1
family_size
Out[19]:
0 2
1 2
2 1
3 2
4 1
..
882 1
883 1
884 4
885 1
886 1
Length: 887, dtype: int64
In [21]:
Out[21]:
0 3.62500
1 35.64165
2 7.92500
3 26.55000
4 8.05000
...
882 13.00000
883 30.00000
884 5.86250
885 30.00000
886 7.75000
Length: 887, dtype: float64
9. Calculate Fare Per Family Member¶
In [22]:
fare_per_family_member = df['Fare'] / family_size
fare_per_family_member
Out[22]:
0 3.62500
1 35.64165
2 7.92500
3 26.55000
4 8.05000
...
882 13.00000
883 30.00000
884 5.86250
885 30.00000
886 7.75000
Length: 887, dtype: float64
Just for Exploration¶
Curious about the age makeup of passengers in each class aboard the Titanic? This histogram will take you through the age distribution across the different passenger classes, revealing the dynamics between age and class structure during this fateful journey.
In [24]:
plt.figure(figsize=(10,6))
sns.histplot(df, x='Age', hue='Pclass', multiple='stack', bins=20, palette='coolwarm')
plt.title('Age Distribution by Passenger Class')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()
10. Calculate Weighted Age Using Fare Weight¶
In [26]:
fare_weight = df['Fare'] / df['Fare'].max()
weighted_age = df['Age'] * fare_weight
weighted_age
Out[26]:
0 0.311323
1 5.287158
2 0.402183
3 3.627550
4 0.549939
...
882 0.685106
883 1.112566
884 0.320399
885 1.522459
886 0.484064
Length: 887, dtype: float64
11. Calculate Cumulative Fare Percentage¶
In [37]:
sorted_fares = df['Fare'].sort_values()
cumulative_fare_percentage = (sorted_fares.cumsum() / sorted_fares.sum()) * 100
In [36]:
Out[36]:
799 0.001607
751 0.004170
466 0.007039
641 0.009908
827 0.013083
...
115 98.867686
95 99.139289
490 99.410891
847 99.693969
627 100.000000
Name: Age, Length: 887, dtype: float64
In [ ]:
12. Identify Fare Outliers Using IQR¶
In [42]:
Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
is_fare_outlier = (df['Fare'] < (Q1 - 1.5 * IQR)) | (df['Fare'] > (Q3 + 1.5 * IQR))
is_fare_outlier
Out[42]:
0 False
1 True
2 False
3 False
4 False
...
882 False
883 False
884 False
885 False
886 False
Name: Fare, Length: 887, dtype: bool
In [46]:
sorted_df = df.sort_index()
rolling_average_fare = sorted_df['Fare'].rolling(window=10, min_periods=1).mean()
rolling_average_fare
Out[46]:
0 7.250000
1 39.266650
2 28.819433
3 34.889575
4 29.521660
...
882 20.303740
883 22.514160
884 24.069580
885 18.753750
886 16.928750
Name: Fare, Length: 887, dtype: float64
13. Calculate Rolling Average of Fare¶
In [ ]:
sorted_df = ...
rolling_average_fare = ...
Just for Exploration¶
What role does age play in the odds of survival? Let’s take a closer look at how age influenced survival during the Titanic disaster. This distribution plot will reveal whether certain age groups were more likely to make it through the tragedy, uncovering patterns that history may have hidden beneath the waves.
In [44]:
import seaborn as sns
# Create a KDE plot for the age distribution of survivors vs non-survivors
plt.figure(figsize=(10, 6))
sns.kdeplot(df[df['Survived'] == 1]['Age'], label='Survived', color='green')
sns.kdeplot(df[df['Survived'] == 0]['Age'], label='Did Not Survive', color='red')
# Add labels and title
plt.title('Age Distribution for Survivors vs Non-Survivors', fontsize=16)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.legend(title='Survival Status')
plt.show()
The age distribution reveals that younger passengers had a slightly higher survival rate. Interestingly, a noticeable drop in survival is observed around mid-adulthood, showing a subtle shift in survival patterns across different age groups.