Statement of Completion#a5d5585e
Intro to Pandas for Data Analysis
medium
Data Analysis and Plotting of Hotel Ratings and Trip Type
Resolution
Activities
Data Analysis and Plotting of Hotel Ratings and Trip Types¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
df = pd.read_csv('data/Data_TripAdvisor_v1 - Data.csv', low_memory=False)
In [3]:
df.head()
Out[3]:
ID_USER | USER_STATE | USER_TIMEZONE | ID_HOTEL | HOTEL_CITY | HOTEL_STATE | HOTEL_TIMEZONE | Trip Type | Rating | |
---|---|---|---|---|---|---|---|---|---|
0 | 45 | GA | Eastern | 105170 | Memphis | TN | Central | 3 | 5 |
1 | 45 | GA | Eastern | 223229 | SanAntonio | TX | Central | 3 | 5 |
2 | 45 | GA | Eastern | 258688 | Albuquerque | NM | Mountain | 3 | 5 |
3 | 45 | GA | Eastern | 98827 | ELPaso | TX | Central | 3 | 5 |
4 | 45 | GA | Eastern | 99518 | SanAntonio | TX | Central | 3 | 3 |
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4669 entries, 0 to 4668 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID_USER 4669 non-null int64 1 USER_STATE 4669 non-null object 2 USER_TIMEZONE 4669 non-null object 3 ID_HOTEL 4669 non-null int64 4 HOTEL_CITY 4669 non-null object 5 HOTEL_STATE 4669 non-null object 6 HOTEL_TIMEZONE 4669 non-null object 7 Trip Type 4669 non-null int64 8 Rating 4669 non-null int64 dtypes: int64(4), object(5) memory usage: 328.4+ KB
1) Analyze Trip Type Distribution¶
In [5]:
trip_type_counts = df['Trip Type'].value_counts()
trip_type_counts
Out[5]:
Trip Type 3 2062 2 1005 1 771 4 635 5 196 Name: count, dtype: int64
2) Create a pie chart to identify the most popular Trip Type¶
In [9]:
# Create a figure and axis
fig, ax = plt.subplots()
trip_type_counts.plot(kind='pie', autopct="%1.1f%%", ax=ax);
ax.set_title('Most Popular Trip Type')
Out[9]:
Text(0.5, 1.0, 'Most Popular Trip Type')
3) Analyzing Hotel City Distribution¶
The HOTEL_CITY
column contains several cities where the hotels are present. Now, use the value_counts()
method on the hotel city column and store it in the city_counts
variable.
In [11]:
city_counts = df['HOTEL_CITY'].value_counts()
city_counts
Out[11]:
HOTEL_CITY Houston 306 SanAntonio 246 Austin 237 Albuquerque 211 Columbus 210 Indianapolis 187 Phoenix 178 OklahomaCity 174 Charlotte 167 Atlanta 164 Tucson 154 Jacksonville 137 Dallas 135 Sacramento 133 Nashville 127 Louisville 123 Wichita 122 Tulsa 117 ELPaso 114 Denver 103 Memphis 102 SanDiego 92 LasVegas 89 Portland 87 Omaha 85 FortWorth 78 LosAngeles 76 Fresno 73 Milwaukee 69 Raleigh 68 Seattle 60 SanJose 56 Arlington 55 Miami 44 Minneapolis 42 Mesa 37 Oakland 36 LongBeach 28 SanFrancisco 28 VirginiaBeach 27 Baltimore 22 Chicago 14 NewYork 12 Philadelphia 11 Cleveland 10 Boston 9 Detroit 7 Kansas 7 Name: count, dtype: int64
4) Create a bar chart to determine the most popular city where the highest number of hotels are present¶
In [14]:
# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 6))
city_counts.plot(kind = 'bar', ax=ax);
ax.set_title("Most Popular Hotel Cities")
ax.set_xlabel('Hotel City')
ax.set_ylabel('Count')
Out[14]:
Text(0, 0.5, 'Count')
5) Analyzing Rating Distribution¶
In [16]:
rating_counts = df['Rating'].value_counts()
rating_counts
Out[16]:
Rating 4 1771 3 1163 5 1046 2 470 1 219 Name: count, dtype: int64
6) Create a bar plot to visualize the distribution of customer ratings¶
In [18]:
# Create a figure and axis
fig, ax = plt.subplots(figsize=(8, 6))
rating_counts.plot(kind='bar',ax=ax);
ax.set_title('Distribution of Customer Ratings')
ax.set_xlabel('Rating')
ax.set_ylabel('Count')
Out[18]:
Text(0, 0.5, 'Count')
7) Which rating was given by the most people?¶
Look at the plot of Activity 3.
8) Count Hotel Timezones¶
In [20]:
hotel_timezone_counts = df['HOTEL_TIMEZONE'].value_counts()
hotel_timezone_counts
Out[20]:
HOTEL_TIMEZONE Central 2030 Eastern 1198 Pacific 758 Mountain 683 Name: count, dtype: int64
9) Calculate the percentage of each hotel timezone¶
In [24]:
hotel_timezone_percents = hotel_timezone_counts / df.shape[0] * 100
hotel_timezone_percents
Out[24]:
HOTEL_TIMEZONE Central 43.478261 Eastern 25.658599 Pacific 16.234740 Mountain 14.628400 Name: count, dtype: float64
In [22]:
df.shape
Out[22]:
(4669, 9)
In [27]:
hotel_timezone_percents.index
Out[27]:
Index(['Central', 'Eastern', 'Pacific', 'Mountain'], dtype='object', name='HOTEL_TIMEZONE')
10) Create a pie chart to visualize the percentage distribution of hotel timezones¶
In [29]:
# Create a figure and axis
fig, ax = plt.subplots(figsize=(8, 6))
hotel_timezone_percents.plot(kind='pie', labels = hotel_timezone_percents.index,
autopct = '%1.1f%%',
ax=ax);
ax.set_title('Percentage Distribution of Hotel Timezones')
Out[29]:
Text(0.5, 1.0, 'Percentage Distribution of Hotel Timezones')
11) Which timezone most of the hotels are in?¶
- Central
- Mountain
- Eastern
- Pacific
For the next activity, we need to calculate the mean rating for each HOTEL_TIMEZONE
. To do this, we will use the pivot_table()
function to group the data by HOTEL_TIMEZONE
and calculate the mean Rating
for each group. Then, we will convert the result to a DataFrame with columns HOTEL_TIMEZONE
and mean Rating
, and store it in the timezone_ratings
variable.
Note: The
pivot_table()
method is out of scope for this project, which is why we have provided the code for you.
In [34]:
df.pivot_table(index='HOTEL_TIMEZONE', values='Rating', aggfunc='mean').reset_index()
Out[34]:
HOTEL_TIMEZONE | Rating | |
---|---|---|
0 | Central | 3.670936 |
1 | Eastern | 3.667780 |
2 | Mountain | 3.560761 |
3 | Pacific | 3.540897 |
In [33]:
df.groupby('HOTEL_TIMEZONE')['Rating'].agg('mean')
Out[33]:
HOTEL_TIMEZONE Central 3.670936 Eastern 3.667780 Mountain 3.560761 Pacific 3.540897 Name: Rating, dtype: float64
In [39]:
timezone_ratings = df.pivot_table(index='HOTEL_TIMEZONE', values='Rating', aggfunc='mean').reset_index()
timezone_ratings
Out[39]:
HOTEL_TIMEZONE | Rating | |
---|---|---|
0 | Central | 3.670936 |
1 | Eastern | 3.667780 |
2 | Mountain | 3.560761 |
3 | Pacific | 3.540897 |
In [41]:
timezone_ratings['HOTEL_TIMEZONE']
Out[41]:
0 Central 1 Eastern 2 Mountain 3 Pacific Name: HOTEL_TIMEZONE, dtype: object
12) Create a bar chart to visualize average ratings by Hotel Timezone using subplots¶
In [45]:
# Create a figure and axis
fig, ax = plt.subplots(figsize=(8, 6))
# Create a bar plot using pandas
timezone_ratings.plot(kind='bar',
x = 'HOTEL_TIMEZONE',
y = 'Rating',
ax=ax);
ax.set_title('Average Rating by Hotel Timezone')
ax.set_xlabel('Hotel Timezone')
ax.set_ylabel('Average Rating')
Out[45]:
Text(0, 0.5, 'Average Rating')
13) What are the two time zones with the highest average ratings?¶
- Central
- Eastern
- Mountain
- Pacific
For the next activity, we need to calculate the mean Rating
for each USER_STATE
using the pivot_table()
function. We will store the result in the user_state_ratings
DataFrame. Then, we will sort the user_state_ratings
DataFrame by the Rating
column in descending order.
Note: The
pivot_table()
method is out of scope for this project, which is why we have provided the code for you.
In [47]:
user_state_ratings = df.pivot_table(index='USER_STATE', values='Rating', aggfunc='mean').reset_index()
user_state_ratings = user_state_ratings.sort_values(by='Rating', ascending=False)
14) Create a bar chart to visualize average ratings by User State using subplots¶
In [49]:
# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 6))
user_state_ratings.plot(kind = 'bar',
x = 'USER_STATE',
y = 'Rating',
ax=ax);
ax.set_title('Average Ratings by User State')
ax.set_xlabel('User State')
ax.set_ylabel('Average Rating')
Out[49]:
Text(0, 0.5, 'Average Rating')
15) Which two user states have given the highest rating?¶
- ND
- DC
- GA
- WV
- ME
16) Filter top-rated hotels¶
In [51]:
top_rated_hotels = df[df['Rating'] >= 4]
In [53]:
df.columns
Out[53]:
Index(['ID_USER', 'USER_STATE', 'USER_TIMEZONE', 'ID_HOTEL', 'HOTEL_CITY', 'HOTEL_STATE', 'HOTEL_TIMEZONE', 'Trip Type', 'Rating'], dtype='object')
17) Select data for a specific hotel timezone¶
In [54]:
timezone = 'Eastern'
timezone_data = top_rated_hotels[top_rated_hotels['HOTEL_TIMEZONE'] == timezone]
In [56]:
timezone_data
Out[56]:
ID_USER | USER_STATE | USER_TIMEZONE | ID_HOTEL | HOTEL_CITY | HOTEL_STATE | HOTEL_TIMEZONE | Trip Type | Rating | |
---|---|---|---|---|---|---|---|---|---|
5 | 64 | TX | Central | 224427 | Cleveland | OH | Eastern | 1 | 4 |
8 | 100 | NY | Eastern | 120111 | Jacksonville | FL | Eastern | 2 | 5 |
9 | 100 | NY | Eastern | 91428 | Indianapolis | IN | Eastern | 2 | 4 |
17 | 159 | TX | Central | 92773 | Louisville | KY | Eastern | 3 | 4 |
27 | 221 | CT | Eastern | 222955 | Boston | MA | Eastern | 1 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4645 | 65342 | CA | Pacific | 224590 | Jacksonville | FL | Eastern | 3 | 5 |
4648 | 65342 | CA | Pacific | 88398 | Jacksonville | FL | Eastern | 3 | 5 |
4649 | 65342 | CA | Pacific | 91607 | Indianapolis | IN | Eastern | 3 | 4 |
4654 | 65406 | FL | Eastern | 91458 | Indianapolis | IN | Eastern | 3 | 5 |
4660 | 65423 | KY | Central | 89380 | Atlanta | GA | Eastern | 4 | 5 |
730 rows × 9 columns
18) Get unique trip types and their counts¶
In [72]:
trip_types = timezone_data['Trip Type'].value_counts()
trip_types
Out[72]:
Trip Type 3 305 2 177 1 118 4 94 5 36 Name: count, dtype: int64
In [81]:
sum(trip_types)
Out[81]:
730
19) Calculate the percentage of each trip type¶
In [96]:
total_trips = trip_types.sum()
trip_type_percents = pd.Series(trip_types / total_trips * 100, index = trip_types.index)
trip_type_percents
Out[96]:
Trip Type 3 41.780822 2 24.246575 1 16.164384 4 12.876712 5 4.931507 Name: count, dtype: float64
In [ ]:
In [94]:
trip_types.index
Out[94]:
Index([3, 2, 1, 4, 5], dtype='int64', name='Trip Type')
In [97]:
trip_type_percents = pd.Series([(count / total_trips) * 100 for count in trip_types.values], index=trip_types.index)
trip_type_percents
Out[97]:
Trip Type 3 41.780822 2 24.246575 1 16.164384 4 12.876712 5 4.931507 dtype: float64
20) Visualizing Trip Type Distribution with a Pie Chart¶
In [98]:
# Create a figure and axis
fig, ax = plt.subplots(figsize=(8, 6))
# Create a pie chart using pandas
trip_type_percents.plot(kind ='pie',
labels = trip_types.index,
autopct = '%1.1f%%',
ax=ax,
title='Distribution of Trip Types for Top-Rated Hotels');
For the next activity, we need to find the top 5 USER_STATE
by the number of ratings and store them in top_user_states
. Then, we will filter the data to include only the top 5 user states and store it in top_user_state_data
.
Next, we will use the pivot_table()
function to find the mean Rating
for each combination of USER_STATE
and HOTEL_TIMEZONE
. We will save the result in user_state_timezone_ratings
.
Note: The
pivot_table()
method is out of scope for this project, which is why we have provided the code for you.
In [108]:
top_user_states = df['USER_STATE'].value_counts().nlargest(5).index
top_user_state_data = df[df['USER_STATE'].isin(top_user_states)]
user_state_timezone_ratings = top_user_state_data.pivot_table(
index=['USER_STATE', 'HOTEL_TIMEZONE'],
values='Rating',
aggfunc='mean'
).reset_index()
In [106]:
top_user_state_data.pivot_table(
index=['USER_STATE', 'HOTEL_TIMEZONE'],
values='Rating',
aggfunc='mean'
).reset_index()
Out[106]:
USER_STATE | HOTEL_TIMEZONE | Rating | |
---|---|---|---|
0 | AZ | Central | 3.408163 |
1 | AZ | Eastern | 3.166667 |
2 | AZ | Mountain | 3.439560 |
3 | AZ | Pacific | 3.476190 |
4 | CA | Central | 3.649254 |
5 | CA | Eastern | 3.711864 |
6 | CA | Mountain | 3.669903 |
7 | CA | Pacific | 3.512456 |
8 | FL | Central | 3.584416 |
9 | FL | Eastern | 3.838710 |
10 | FL | Mountain | 3.550000 |
11 | FL | Pacific | 4.100000 |
12 | OH | Central | 3.468085 |
13 | OH | Eastern | 3.675497 |
14 | OH | Mountain | 4.038462 |
15 | OH | Pacific | 3.714286 |
16 | TX | Central | 3.770062 |
17 | TX | Eastern | 3.769231 |
18 | TX | Mountain | 3.587302 |
19 | TX | Pacific | 3.508197 |
In [110]:
user_state_timezone_ratings.pivot(index='USER_STATE', columns='HOTEL_TIMEZONE', values='Rating')
Out[110]:
HOTEL_TIMEZONE | Central | Eastern | Mountain | Pacific |
---|---|---|---|---|
USER_STATE | ||||
AZ | 3.408163 | 3.166667 | 3.439560 | 3.476190 |
CA | 3.649254 | 3.711864 | 3.669903 | 3.512456 |
FL | 3.584416 | 3.838710 | 3.550000 | 4.100000 |
OH | 3.468085 | 3.675497 | 4.038462 | 3.714286 |
TX | 3.770062 | 3.769231 | 3.587302 | 3.508197 |
21) Create a bar chart to visualize regional rating trends for the top user states across hotel locations using pivoted data.¶
Note: The
pivot()
method is out of scope for this project, which is why we have provided the code for you.
In [112]:
# Pivot the data
pivoted_timezone = user_state_timezone_ratings.pivot(index='USER_STATE', columns='HOTEL_TIMEZONE', values='Rating')
fig, ax = plt.subplots(figsize=(10, 6))
pivoted_timezone.plot(kind='bar', legend = True, rot = 45, ax=ax);
ax.set_title('Average Ratings of Top User States across Hotel Timezones')
ax.set_xlabel('Hotel Timezone')
ax.set_ylabel('Average Rating')
Out[112]:
Text(0, 0.5, 'Average Rating')
For the next activity, we need to find the top 5 USER_STATE
by the count of ratings. Then, we will filter the data to include only the top 5 user states. Next, we will calculate the mean rating for each combination of user state and hotel state using the pivot_table()
function. Finally, we will sort the data by Rating
in descending order.
Note: The
pivot_table()
method is out of scope for this project, which is why we have provided the code for you.
In [114]:
top_states = df['USER_STATE'].value_counts().nlargest(5).index
top_data = df[df['USER_STATE'].isin(top_states)]
ratings = top_data.pivot_table(index=['USER_STATE', 'HOTEL_STATE'], values='Rating', aggfunc='mean').reset_index()
ratings = ratings.sort_values(by='Rating', ascending=False)
22) Visualizing Average Ratings with a Bar Chart¶
Note: The
pivot()
method is out of scope for this project, which is why we have provided the code for you.
In [ ]:
# Pivot the data
pivoted_state = ratings.pivot(index='USER_STATE', columns='HOTEL_STATE', values='Rating')
# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 8))
# Create a bar chart using pandas
pivoted_state.plot(kind = 'bar', legend = True, ax=ax);
ax.set_title('Average Ratings of Top User States by Hotel State')
ax.set_xlabel('Hotel State')
ax.set_ylabel('Average Rating')