Data Analysis and Plotting of Hotel Ratings and Trip Types¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:

df = pd.read_csv('data/Data_TripAdvisor_v1 - Data.csv', low_memory=False)

In [3]:

df.head()

Out[3]:

	ID_USER	USER_STATE	USER_TIMEZONE	ID_HOTEL	HOTEL_CITY	HOTEL_STATE	HOTEL_TIMEZONE	Trip Type	Rating
0	45	GA	Eastern	105170	Memphis	TN	Central	3	5
1	45	GA	Eastern	223229	SanAntonio	TX	Central	3	5
2	45	GA	Eastern	258688	Albuquerque	NM	Mountain	3	5
3	45	GA	Eastern	98827	ELPaso	TX	Central	3	5
4	45	GA	Eastern	99518	SanAntonio	TX	Central	3	3

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4669 entries, 0 to 4668
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   ID_USER         4669 non-null   int64 
 1   USER_STATE      4669 non-null   object
 2   USER_TIMEZONE   4669 non-null   object
 3   ID_HOTEL        4669 non-null   int64 
 4   HOTEL_CITY      4669 non-null   object
 5   HOTEL_STATE     4669 non-null   object
 6   HOTEL_TIMEZONE  4669 non-null   object
 7   Trip Type       4669 non-null   int64 
 8   Rating          4669 non-null   int64 
dtypes: int64(4), object(5)
memory usage: 328.4+ KB

1) Analyze Trip Type Distribution¶

In [5]:

trip_type_counts = df['Trip Type'].value_counts()
trip_type_counts

Out[5]:

Trip Type
3    2062
2    1005
1     771
4     635
5     196
Name: count, dtype: int64

2) Create a pie chart to identify the most popular Trip Type¶

In [9]:

# Create a figure and axis
fig, ax = plt.subplots()

trip_type_counts.plot(kind='pie', autopct="%1.1f%%", ax=ax);
ax.set_title('Most Popular Trip Type')

Out[9]:

Text(0.5, 1.0, 'Most Popular Trip Type')

3) Analyzing Hotel City Distribution¶

The HOTEL_CITY column contains several cities where the hotels are present. Now, use the value_counts() method on the hotel city column and store it in the city_counts variable.

In [11]:

city_counts = df['HOTEL_CITY'].value_counts()
city_counts

Out[11]:

HOTEL_CITY
Houston          306
SanAntonio       246
Austin           237
Albuquerque      211
Columbus         210
Indianapolis     187
Phoenix          178
OklahomaCity     174
Charlotte        167
Atlanta          164
Tucson           154
Jacksonville     137
Dallas           135
Sacramento       133
Nashville        127
Louisville       123
Wichita          122
Tulsa            117
ELPaso           114
Denver           103
Memphis          102
SanDiego          92
LasVegas          89
Portland          87
Omaha             85
FortWorth         78
LosAngeles        76
Fresno            73
Milwaukee         69
Raleigh           68
Seattle           60
SanJose           56
Arlington         55
Miami             44
Minneapolis       42
Mesa              37
Oakland           36
LongBeach         28
SanFrancisco      28
VirginiaBeach     27
Baltimore         22
Chicago           14
NewYork           12
Philadelphia      11
Cleveland         10
Boston             9
Detroit            7
Kansas             7
Name: count, dtype: int64

4) Create a bar chart to determine the most popular city where the highest number of hotels are present¶

In [14]:

# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 6))

city_counts.plot(kind = 'bar', ax=ax);

ax.set_title("Most Popular Hotel Cities")
ax.set_xlabel('Hotel City')
ax.set_ylabel('Count')

Out[14]:

Text(0, 0.5, 'Count')

5) Analyzing Rating Distribution¶

In [16]:

rating_counts = df['Rating'].value_counts()
rating_counts

Out[16]:

Rating
4    1771
3    1163
5    1046
2     470
1     219
Name: count, dtype: int64

6) Create a bar plot to visualize the distribution of customer ratings¶

In [18]:

# Create a figure and axis
fig, ax = plt.subplots(figsize=(8, 6))

rating_counts.plot(kind='bar',ax=ax);
ax.set_title('Distribution of Customer Ratings')
ax.set_xlabel('Rating')
ax.set_ylabel('Count')

Out[18]:

Text(0, 0.5, 'Count')

7) Which rating was given by the most people?¶

Look at the plot of Activity 3.

8) Count Hotel Timezones¶

In [20]:

hotel_timezone_counts = df['HOTEL_TIMEZONE'].value_counts()
hotel_timezone_counts

Out[20]:

HOTEL_TIMEZONE
Central     2030
Eastern     1198
Pacific      758
Mountain     683
Name: count, dtype: int64

9) Calculate the percentage of each hotel timezone¶

In [24]:

hotel_timezone_percents = hotel_timezone_counts / df.shape[0] * 100
hotel_timezone_percents

Out[24]:

HOTEL_TIMEZONE
Central     43.478261
Eastern     25.658599
Pacific     16.234740
Mountain    14.628400
Name: count, dtype: float64

In [22]:

df.shape

Out[22]:

(4669, 9)

In [27]:

hotel_timezone_percents.index

Out[27]:

Index(['Central', 'Eastern', 'Pacific', 'Mountain'], dtype='object', name='HOTEL_TIMEZONE')

10) Create a pie chart to visualize the percentage distribution of hotel timezones¶

In [29]:

# Create a figure and axis
fig, ax = plt.subplots(figsize=(8, 6))

hotel_timezone_percents.plot(kind='pie', labels = hotel_timezone_percents.index, 
                             autopct = '%1.1f%%',
                             ax=ax);
ax.set_title('Percentage Distribution of Hotel Timezones')

Out[29]:

Text(0.5, 1.0, 'Percentage Distribution of Hotel Timezones')

11) Which timezone most of the hotels are in?¶

Central
Mountain
Eastern
Pacific

For the next activity, we need to calculate the mean rating for each HOTEL_TIMEZONE. To do this, we will use the pivot_table() function to group the data by HOTEL_TIMEZONE and calculate the mean Rating for each group. Then, we will convert the result to a DataFrame with columns HOTEL_TIMEZONE and mean Rating, and store it in the timezone_ratings variable.

Note: The pivot_table() method is out of scope for this project, which is why we have provided the code for you.

In [34]:

df.pivot_table(index='HOTEL_TIMEZONE', values='Rating', aggfunc='mean').reset_index()

Out[34]:

	HOTEL_TIMEZONE	Rating
0	Central	3.670936
1	Eastern	3.667780
2	Mountain	3.560761
3	Pacific	3.540897

In [33]:

df.groupby('HOTEL_TIMEZONE')['Rating'].agg('mean')

Out[33]:

HOTEL_TIMEZONE
Central     3.670936
Eastern     3.667780
Mountain    3.560761
Pacific     3.540897
Name: Rating, dtype: float64

In [39]:

timezone_ratings = df.pivot_table(index='HOTEL_TIMEZONE', values='Rating', aggfunc='mean').reset_index()
timezone_ratings

Out[39]:

	HOTEL_TIMEZONE	Rating
0	Central	3.670936
1	Eastern	3.667780
2	Mountain	3.560761
3	Pacific	3.540897

In [41]:

timezone_ratings['HOTEL_TIMEZONE']

Out[41]:

0     Central
1     Eastern
2    Mountain
3     Pacific
Name: HOTEL_TIMEZONE, dtype: object

12) Create a bar chart to visualize average ratings by Hotel Timezone using subplots¶

In [45]:

# Create a figure and axis
fig, ax = plt.subplots(figsize=(8, 6))

# Create a bar plot using pandas
timezone_ratings.plot(kind='bar', 
                      x = 'HOTEL_TIMEZONE',
                      y = 'Rating',
                      ax=ax);
ax.set_title('Average Rating by Hotel Timezone')
ax.set_xlabel('Hotel Timezone')
ax.set_ylabel('Average Rating')

Out[45]:

Text(0, 0.5, 'Average Rating')

13) What are the two time zones with the highest average ratings?¶

Central
Eastern
Mountain
Pacific

For the next activity, we need to calculate the mean Rating for each USER_STATE using the pivot_table() function. We will store the result in the user_state_ratings DataFrame. Then, we will sort the user_state_ratings DataFrame by the Rating column in descending order.

Note: The pivot_table() method is out of scope for this project, which is why we have provided the code for you.

In [47]:

user_state_ratings = df.pivot_table(index='USER_STATE', values='Rating', aggfunc='mean').reset_index()

user_state_ratings = user_state_ratings.sort_values(by='Rating', ascending=False)

14) Create a bar chart to visualize average ratings by User State using subplots¶

In [49]:

# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 6))

user_state_ratings.plot(kind = 'bar',
                        x = 'USER_STATE',
                        y = 'Rating',
                        ax=ax);

ax.set_title('Average Ratings by User State')
ax.set_xlabel('User State')
ax.set_ylabel('Average Rating')

Out[49]:

Text(0, 0.5, 'Average Rating')

15) Which two user states have given the highest rating?¶

ND
DC
GA
WV
ME

16) Filter top-rated hotels¶

In [51]:

top_rated_hotels = df[df['Rating'] >= 4]

In [53]:

df.columns

Out[53]:

Index(['ID_USER', 'USER_STATE', 'USER_TIMEZONE', 'ID_HOTEL', 'HOTEL_CITY',
       'HOTEL_STATE', 'HOTEL_TIMEZONE', 'Trip Type', 'Rating'],
      dtype='object')

17) Select data for a specific hotel timezone¶

In [54]:

timezone = 'Eastern'
timezone_data = top_rated_hotels[top_rated_hotels['HOTEL_TIMEZONE'] == timezone]

In [56]:

timezone_data

Out[56]:

	ID_USER	USER_STATE	USER_TIMEZONE	ID_HOTEL	HOTEL_CITY	HOTEL_STATE	HOTEL_TIMEZONE	Trip Type	Rating
5	64	TX	Central	224427	Cleveland	OH	Eastern	1	4
8	100	NY	Eastern	120111	Jacksonville	FL	Eastern	2	5
9	100	NY	Eastern	91428	Indianapolis	IN	Eastern	2	4
17	159	TX	Central	92773	Louisville	KY	Eastern	3	4
27	221	CT	Eastern	222955	Boston	MA	Eastern	1	5
...	...	...	...	...	...	...	...	...	...
4645	65342	CA	Pacific	224590	Jacksonville	FL	Eastern	3	5
4648	65342	CA	Pacific	88398	Jacksonville	FL	Eastern	3	5
4649	65342	CA	Pacific	91607	Indianapolis	IN	Eastern	3	4
4654	65406	FL	Eastern	91458	Indianapolis	IN	Eastern	3	5
4660	65423	KY	Central	89380	Atlanta	GA	Eastern	4	5

730 rows × 9 columns

18) Get unique trip types and their counts¶

In [72]:

trip_types = timezone_data['Trip Type'].value_counts()
trip_types

Out[72]:

Trip Type
3    305
2    177
1    118
4     94
5     36
Name: count, dtype: int64

In [81]:

sum(trip_types)

Out[81]:

19) Calculate the percentage of each trip type¶

In [96]:

total_trips = trip_types.sum()
trip_type_percents = pd.Series(trip_types / total_trips * 100, index = trip_types.index)
trip_type_percents

Out[96]:

Trip Type
3    41.780822
2    24.246575
1    16.164384
4    12.876712
5     4.931507
Name: count, dtype: float64

In [ ]:

In [94]:

trip_types.index

Out[94]:

Index([3, 2, 1, 4, 5], dtype='int64', name='Trip Type')

In [97]:

trip_type_percents = pd.Series([(count / total_trips) * 100 for count in trip_types.values], index=trip_types.index)
trip_type_percents

Out[97]:

Trip Type
3    41.780822
2    24.246575
1    16.164384
4    12.876712
5     4.931507
dtype: float64

20) Visualizing Trip Type Distribution with a Pie Chart¶

In [98]:

# Create a figure and axis
fig, ax = plt.subplots(figsize=(8, 6))

# Create a pie chart using pandas
trip_type_percents.plot(kind ='pie', 
                        labels = trip_types.index, 
                        autopct = '%1.1f%%', 
                        ax=ax,
                        title='Distribution of Trip Types for Top-Rated Hotels');

For the next activity, we need to find the top 5 USER_STATE by the number of ratings and store them in top_user_states. Then, we will filter the data to include only the top 5 user states and store it in top_user_state_data.

Next, we will use the pivot_table() function to find the mean Rating for each combination of USER_STATE and HOTEL_TIMEZONE. We will save the result in user_state_timezone_ratings.

Note: The pivot_table() method is out of scope for this project, which is why we have provided the code for you.

In [108]:

top_user_states = df['USER_STATE'].value_counts().nlargest(5).index

top_user_state_data = df[df['USER_STATE'].isin(top_user_states)]

user_state_timezone_ratings = top_user_state_data.pivot_table(
    index=['USER_STATE', 'HOTEL_TIMEZONE'],
    values='Rating',
    aggfunc='mean'
).reset_index()

In [106]:

top_user_state_data.pivot_table(
    index=['USER_STATE', 'HOTEL_TIMEZONE'],
    values='Rating',
    aggfunc='mean'
).reset_index()

Out[106]:

	USER_STATE	HOTEL_TIMEZONE	Rating
0	AZ	Central	3.408163
1	AZ	Eastern	3.166667
2	AZ	Mountain	3.439560
3	AZ	Pacific	3.476190
4	CA	Central	3.649254
5	CA	Eastern	3.711864
6	CA	Mountain	3.669903
7	CA	Pacific	3.512456
8	FL	Central	3.584416
9	FL	Eastern	3.838710
10	FL	Mountain	3.550000
11	FL	Pacific	4.100000
12	OH	Central	3.468085
13	OH	Eastern	3.675497
14	OH	Mountain	4.038462
15	OH	Pacific	3.714286
16	TX	Central	3.770062
17	TX	Eastern	3.769231
18	TX	Mountain	3.587302
19	TX	Pacific	3.508197

In [110]:

user_state_timezone_ratings.pivot(index='USER_STATE', columns='HOTEL_TIMEZONE', values='Rating')

Out[110]:

HOTEL_TIMEZONE	Central	Eastern	Mountain	Pacific
USER_STATE
AZ	3.408163	3.166667	3.439560	3.476190
CA	3.649254	3.711864	3.669903	3.512456
FL	3.584416	3.838710	3.550000	4.100000
OH	3.468085	3.675497	4.038462	3.714286
TX	3.770062	3.769231	3.587302	3.508197

21) Create a bar chart to visualize regional rating trends for the top user states across hotel locations using pivoted data.¶

Note: The pivot() method is out of scope for this project, which is why we have provided the code for you.

In [112]:

# Pivot the data
pivoted_timezone = user_state_timezone_ratings.pivot(index='USER_STATE', columns='HOTEL_TIMEZONE', values='Rating')

fig, ax = plt.subplots(figsize=(10, 6))

pivoted_timezone.plot(kind='bar', legend = True, rot = 45, ax=ax);
ax.set_title('Average Ratings of Top User States across Hotel Timezones')
ax.set_xlabel('Hotel Timezone')
ax.set_ylabel('Average Rating')

Out[112]:

Text(0, 0.5, 'Average Rating')

For the next activity, we need to find the top 5 USER_STATE by the count of ratings. Then, we will filter the data to include only the top 5 user states. Next, we will calculate the mean rating for each combination of user state and hotel state using the pivot_table() function. Finally, we will sort the data by Rating in descending order.

Note: The pivot_table() method is out of scope for this project, which is why we have provided the code for you.

In [114]:

top_states = df['USER_STATE'].value_counts().nlargest(5).index
top_data = df[df['USER_STATE'].isin(top_states)]
ratings = top_data.pivot_table(index=['USER_STATE', 'HOTEL_STATE'], values='Rating', aggfunc='mean').reset_index()
ratings = ratings.sort_values(by='Rating', ascending=False)

22) Visualizing Average Ratings with a Bar Chart¶

Note: The pivot() method is out of scope for this project, which is why we have provided the code for you.

In [ ]:

# Pivot the data
pivoted_state = ratings.pivot(index='USER_STATE', columns='HOTEL_STATE', values='Rating')

# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 8))

# Create a bar chart using pandas
pivoted_state.plot(kind = 'bar', legend = True, ax=ax);
ax.set_title('Average Ratings of Top User States by Hotel State')
ax.set_xlabel('Hotel State')
ax.set_ylabel('Average Rating')

Statement of Completion#a5d5585e

Intro to Pandas for Data Analysis

Data Analysis and Plotting of Hotel Ratings and Trip Type

Data Analysis and Plotting of Hotel Ratings and Trip Types¶

1) Analyze Trip Type Distribution¶

2) Create a pie chart to identify the most popular Trip Type¶

3) Analyzing Hotel City Distribution¶

4) Create a bar chart to determine the most popular city where the highest number of hotels are present¶

5) Analyzing Rating Distribution¶

6) Create a bar plot to visualize the distribution of customer ratings¶

7) Which rating was given by the most people?¶

8) Count Hotel Timezones¶

9) Calculate the percentage of each hotel timezone¶

10) Create a pie chart to visualize the percentage distribution of hotel timezones¶

11) Which timezone most of the hotels are in?¶

12) Create a bar chart to visualize average ratings by Hotel Timezone using subplots¶

13) What are the two time zones with the highest average ratings?¶

14) Create a bar chart to visualize average ratings by User State using subplots¶

15) Which two user states have given the highest rating?¶

16) Filter top-rated hotels¶

17) Select data for a specific hotel timezone¶

18) Get unique trip types and their counts¶

19) Calculate the percentage of each trip type¶

20) Visualizing Trip Type Distribution with a Pie Chart¶

21) Create a bar chart to visualize regional rating trends for the top user states across hotel locations using pivoted data.¶

22) Visualizing Average Ratings with a Bar Chart¶