Md Harun Or Roshid has successfully completed this project.

Intro to Pandas for Data Analysis

medium

4.8

Data at Sea: Series Operations on the Titanic Dataset

Finished

February 9, 2025 11:49 AM

Elapsed time (min)

Completed activities

Resolution

Activities

Introduction¶

In [7]:

# importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# loading the dataset
df = pd.read_csv('titanic.csv')

In [8]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Survived                 887 non-null    int64  
 1   Pclass                   887 non-null    int64  
 2   Name                     887 non-null    object 
 3   Sex                      887 non-null    object 
 4   Age                      887 non-null    float64
 5   Siblings/Spouses Aboard  887 non-null    int64  
 6   Parents/Children Aboard  887 non-null    int64  
 7   Fare                     887 non-null    float64
dtypes: float64(2), int64(4), object(2)
memory usage: 55.6+ KB

In [3]:

df.head()

Out[3]:

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Fare
0	0	3	Mr. Owen Harris Braund	male	22.0	1	7.2500
1	1	1	Mrs. John Bradley (Florence Briggs Thayer) Cum...	female	38.0	1	71.2833
2	1	3	Miss. Laina Heikkinen	female	26.0	0	7.9250
3	1	1	Mrs. Jacques Heath (Lily May Peel) Futrelle	female	35.0	1	53.1000
4	0	3	Mr. William Henry Allen	male	35.0	0	8.0500

Warm Up Activities¶

1. What is the primary advantage of using vectorized operations in pandas compared to traditional loops?¶

In [ ]:

2. In the Titanic dataset, which of the following operations would `NOT` be considered a vectorized operation?¶

In [4]:

df['Fare'] > 50

Out[4]:

0      False
1       True
2      False
3       True
4      False
       ...  
882    False
883    False
884    False
885    False
886    False
Name: Fare, Length: 887, dtype: bool

3. What is average age of passengers?¶

In [5]:

average_age = df['Age'].mean()
average_age

Out[5]:

29.471443066516347

In [10]:

Out[10]:

29.471443066516347

4. Who were the survivors ?¶

In [6]:

df['Age']=df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Age']

/tmp/ipykernel_602/2126654435.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age']=df['Age'].fillna(df['Age'].mean(), inplace=True)

Out[6]:

0      None
1      None
2      None
3      None
4      None
       ... 
882    None
883    None
884    None
885    None
886    None
Name: Age, Length: 887, dtype: object

In [16]:

df[]

Out[16]:

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare	NewCol
0	0	3	Mr. Owen Harris Braund	male	22.0	1	0	7.2500	14.5000
1	1	1	Mrs. John Bradley (Florence Briggs Thayer) Cum...	female	38.0	1	0	71.2833	142.5666
2	1	3	Miss. Laina Heikkinen	female	26.0	0	0	7.9250	15.8500
3	1	1	Mrs. Jacques Heath (Lily May Peel) Futrelle	female	35.0	1	0	53.1000	106.2000
4	0	3	Mr. William Henry Allen	male	35.0	0	0	8.0500	16.1000
...	...	...	...	...	...	...	...	...	...
882	0	2	Rev. Juozas Montvila	male	27.0	0	0	13.0000	26.0000
883	1	1	Miss. Margaret Edith Graham	female	19.0	0	0	30.0000	60.0000
884	0	3	Miss. Catherine Helen Johnston	female	7.0	1	2	23.4500	46.9000
885	1	1	Mr. Karl Howell Behr	male	26.0	0	0	30.0000	60.0000
886	0	3	Mr. Patrick Dooley	male	32.0	0	0	7.7500	15.5000

887 rows × 9 columns

5. What does the following code do?¶

In [ ]:

Just for Exploration¶

Does wealth determine fate in a crisis? Let’s dive into the class system aboard the Titanic by visualizing survival rates based on passenger class. This bar chart paints a vivid picture of how socio-economic status may have played a role in determining who lived and who perished during one of history's most infamous maritime disasters.

In [25]:

# Plot survival rate by passenger class
survival_by_class = df.groupby('Pclass')['Survived'].mean().reset_index()
sns.barplot(x='Pclass', y='Survived', data=survival_by_class)
plt.title('Survival Rate by Passenger Class')
plt.ylabel('Survival Rate')
plt.xlabel('Passenger Class')
plt.show()

No description has been provided for this image

First-class passengers were significantly more likely to survive, with over 60% survival rate, whereas third-class passengers faced the greatest risk. This stark contrast highlights the impact of social class on survival during the tragedy.

Activities¶

6. Calculating the age difference¶

In [9]:

mean_age = df['Age'].mean()
age_difference = df['Age'] - mean_age
age_difference

Out[9]:

0      -7.471443
1       8.528557
2      -3.471443
3       5.528557
4       5.528557
         ...    
882    -2.471443
883   -10.471443
884   -22.471443
885    -3.471443
886     2.528557
Name: Age, Length: 887, dtype: float64

7. Normalize the `Fare` column¶

In [13]:

fare_min = df['Fare'].min() # min fare
fare_max = df['Fare'].max() # max fare
normalized_fare = (df['Fare'] - fare_min) / (fare_max - fare_min) # normalized fare
normalized_fare

Out[13]:

0      0.014151
1      0.139136
2      0.015469
3      0.103644
4      0.015713
         ...   
882    0.025374
883    0.058556
884    0.045771
885    0.058556
886    0.015127
Name: Fare, Length: 887, dtype: float64

8. Calculate family size¶

In [19]:

family_size = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard'] +1
family_size

Out[19]:

0      2
1      2
2      1
3      2
4      1
      ..
882    1
883    1
884    4
885    1
886    1
Length: 887, dtype: int64

In [21]:

Out[21]:

0       3.62500
1      35.64165
2       7.92500
3      26.55000
4       8.05000
         ...   
882    13.00000
883    30.00000
884     5.86250
885    30.00000
886     7.75000
Length: 887, dtype: float64

9. Calculate Fare Per Family Member¶

In [22]:

fare_per_family_member = df['Fare'] / family_size
fare_per_family_member

Out[22]:

0       3.62500
1      35.64165
2       7.92500
3      26.55000
4       8.05000
         ...   
882    13.00000
883    30.00000
884     5.86250
885    30.00000
886     7.75000
Length: 887, dtype: float64

Just for Exploration¶

Curious about the age makeup of passengers in each class aboard the Titanic? This histogram will take you through the age distribution across the different passenger classes, revealing the dynamics between age and class structure during this fateful journey.

In [24]:

plt.figure(figsize=(10,6))
sns.histplot(df, x='Age', hue='Pclass', multiple='stack', bins=20, palette='coolwarm')
plt.title('Age Distribution by Passenger Class')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

10. Calculate Weighted Age Using Fare Weight¶

In [26]:

fare_weight = df['Fare'] / df['Fare'].max()
weighted_age = df['Age'] * fare_weight
weighted_age

Out[26]:

0      0.311323
1      5.287158
2      0.402183
3      3.627550
4      0.549939
         ...   
882    0.685106
883    1.112566
884    0.320399
885    1.522459
886    0.484064
Length: 887, dtype: float64

11. Calculate Cumulative Fare Percentage¶

In [37]:

sorted_fares = df['Fare'].sort_values()
cumulative_fare_percentage = (sorted_fares.cumsum() / sorted_fares.sum()) * 100

In [36]:

Out[36]:

799      0.001607
751      0.004170
466      0.007039
641      0.009908
827      0.013083
          ...    
115     98.867686
95      99.139289
490     99.410891
847     99.693969
627    100.000000
Name: Age, Length: 887, dtype: float64

In [ ]:

12. Identify Fare Outliers Using IQR¶

In [42]:

Q1 = df['Fare'].quantile(0.25)
Q3 = df['Fare'].quantile(0.75)
IQR = Q3 - Q1
is_fare_outlier = (df['Fare'] < (Q1 - 1.5 * IQR)) | (df['Fare'] > (Q3 + 1.5 * IQR))
is_fare_outlier

Out[42]:

0      False
1       True
2      False
3      False
4      False
       ...  
882    False
883    False
884    False
885    False
886    False
Name: Fare, Length: 887, dtype: bool

In [46]:

sorted_df = df.sort_index()
rolling_average_fare = sorted_df['Fare'].rolling(window=10, min_periods=1).mean()
rolling_average_fare

Out[46]:

0       7.250000
1      39.266650
2      28.819433
3      34.889575
4      29.521660
         ...    
882    20.303740
883    22.514160
884    24.069580
885    18.753750
886    16.928750
Name: Fare, Length: 887, dtype: float64

13. Calculate Rolling Average of Fare¶

In [ ]:

sorted_df = ...
rolling_average_fare = ...

Just for Exploration¶

What role does age play in the odds of survival? Let’s take a closer look at how age influenced survival during the Titanic disaster. This distribution plot will reveal whether certain age groups were more likely to make it through the tragedy, uncovering patterns that history may have hidden beneath the waves.

In [44]:

import seaborn as sns
# Create a KDE plot for the age distribution of survivors vs non-survivors
plt.figure(figsize=(10, 6))
sns.kdeplot(df[df['Survived'] == 1]['Age'], label='Survived', color='green')
sns.kdeplot(df[df['Survived'] == 0]['Age'], label='Did Not Survive', color='red')

# Add labels and title
plt.title('Age Distribution for Survivors vs Non-Survivors', fontsize=16)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.legend(title='Survival Status')
plt.show()

The age distribution reveals that younger passengers had a slightly higher survival rate. Interestingly, a noticeable drop in survival is observed around mid-adulthood, showing a subtle shift in survival patterns across different age groups.

Statement of Completion#01b8a03b

Intro to Pandas for Data Analysis

Data at Sea: Series Operations on the Titanic Dataset

Introduction¶

Warm Up Activities¶

1. What is the primary advantage of using vectorized operations in pandas compared to traditional loops?¶

2. In the Titanic dataset, which of the following operations would NOT be considered a vectorized operation?¶

3. What is average age of passengers?¶

4. Who were the survivors ?¶

5. What does the following code do?¶

Just for Exploration¶

Activities¶

6. Calculating the age difference¶

7. Normalize the Fare column¶

8. Calculate family size¶

9. Calculate Fare Per Family Member¶

Just for Exploration¶

10. Calculate Weighted Age Using Fare Weight¶

11. Calculate Cumulative Fare Percentage¶

12. Identify Fare Outliers Using IQR¶

13. Calculate Rolling Average of Fare¶

Just for Exploration¶

2. In the Titanic dataset, which of the following operations would `NOT` be considered a vectorized operation?¶

7. Normalize the `Fare` column¶