Statement of Completion#96308455
Intro to Pandas for Data Analysis
medium
Spotify Data Explorer: Honing DataFrame Mutation Techniques
Resolution
Activities
Honing DataFrame Mutation Techniques Using Spotify Data¶
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.read_csv('data/data.csv', low_memory=False)
In [3]:
df.head()
Out[3]:
id | name | artists | duration_ms | release_date | year | acousticness | danceability | energy | instrumentalness | liveness | loudness | speechiness | tempo | valence | mode | key | popularity | explicit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6KbQ3uYMLKb5jDxLF7wYDD | Singende Bataillone 1. Teil | ['Carl Woitschach'] | 158648 | 1928 | 1928 | 0.995 | 0.708 | 0.1950 | 0.563 | 0.1510 | -12.428 | 0.0506 | 118.469 | 0.7790 | 1 | 10 | 0 | 0 |
1 | 6KuQTIu1KoTTkLXKrwlLPV | Fantasiestücke, Op. 111: Più tosto lento | ['Robert Schumann', 'Vladimir Horowitz'] | 282133 | 1928 | 1928 | 0.994 | 0.379 | 0.0135 | 0.901 | 0.0763 | -28.454 | 0.0462 | 83.972 | 0.0767 | 1 | 8 | 0 | 0 |
2 | 6L63VW0PibdM1HDSBoqnoM | Chapter 1.18 - Zamek kaniowski | ['Seweryn Goszczyński'] | 104300 | 1928 | 1928 | 0.604 | 0.749 | 0.2200 | 0.000 | 0.1190 | -19.924 | 0.9290 | 107.177 | 0.8800 | 0 | 5 | 0 | 0 |
3 | 6M94FkXd15sOAOQYRnWPN8 | Bebamos Juntos - Instrumental (Remasterizado) | ['Francisco Canaro'] | 180760 | 9/25/28 | 1928 | 0.995 | 0.781 | 0.1300 | 0.887 | 0.1110 | -14.734 | 0.0926 | 108.003 | 0.7200 | 0 | 1 | 0 | 0 |
4 | 6N6tiFZ9vLTSOIxkj8qKrd | Polonaise-Fantaisie in A-Flat Major, Op. 61 | ['Frédéric Chopin', 'Vladimir Horowitz'] | 687733 | 1928 | 1928 | 0.990 | 0.210 | 0.2040 | 0.908 | 0.0980 | -16.829 | 0.0424 | 62.149 | 0.0693 | 1 | 11 | 1 | 0 |
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 169909 entries, 0 to 169908 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 169909 non-null object 1 name 169909 non-null object 2 artists 169909 non-null object 3 duration_ms 169909 non-null int64 4 release_date 169909 non-null object 5 year 169909 non-null int64 6 acousticness 169909 non-null float64 7 danceability 169909 non-null float64 8 energy 169909 non-null float64 9 instrumentalness 169909 non-null float64 10 liveness 169909 non-null float64 11 loudness 169909 non-null float64 12 speechiness 169909 non-null float64 13 tempo 169909 non-null float64 14 valence 169909 non-null float64 15 mode 169909 non-null int64 16 key 169909 non-null int64 17 popularity 169909 non-null int64 18 explicit 169909 non-null int64 dtypes: float64(9), int64(6), object(4) memory usage: 24.6+ MB
1) Rename the acousticness
column to acoustic_level
¶
In [6]:
# Write your code here
df.rename(columns = {'acousticness': 'acoustic_level'}, inplace = True)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 169909 entries, 0 to 169908 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 169909 non-null object 1 name 169909 non-null object 2 artists 169909 non-null object 3 duration_ms 169909 non-null int64 4 release_date 169909 non-null object 5 year 169909 non-null int64 6 acoustic_level 169909 non-null float64 7 danceability 169909 non-null float64 8 energy 169909 non-null float64 9 instrumentalness 169909 non-null float64 10 liveness 169909 non-null float64 11 loudness 169909 non-null float64 12 speechiness 169909 non-null float64 13 tempo 169909 non-null float64 14 valence 169909 non-null float64 15 mode 169909 non-null int64 16 key 169909 non-null int64 17 popularity 169909 non-null int64 18 explicit 169909 non-null int64 dtypes: float64(9), int64(6), object(4) memory usage: 25.9+ MB
2) Rename Multiple Columns Using the rename()
Function¶
In [8]:
# Write your code here
df.rename(columns = {'danceability': 'dance_score',
'duration_ms': 'duration_milliseconds',
'instrumentalness': 'instrumental',
'liveness': 'live_performance',
'speechiness': 'speech_presence'}, inplace = True)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 169909 entries, 0 to 169908 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 169909 non-null object 1 name 169909 non-null object 2 artists 169909 non-null object 3 duration_milliseconds 169909 non-null int64 4 release_date 169909 non-null object 5 year 169909 non-null int64 6 acoustic_level 169909 non-null float64 7 dance_score 169909 non-null float64 8 energy 169909 non-null float64 9 instrumental 169909 non-null float64 10 live_performance 169909 non-null float64 11 loudness 169909 non-null float64 12 speech_presence 169909 non-null float64 13 tempo 169909 non-null float64 14 valence 169909 non-null float64 15 mode 169909 non-null int64 16 key 169909 non-null int64 17 popularity 169909 non-null int64 18 explicit 169909 non-null int64 dtypes: float64(9), int64(6), object(4) memory usage: 25.9+ MB
3) Add a new column called duration_seconds
that converts the duration_milliseconds
column from milliseconds to seconds¶
In [18]:
# Write your code here
df['duration_seconds'] = df['duration_milliseconds'] / 1000
In [17]:
df['duration_milliseconds'] / 1000
Out[17]:
0 158.648 1 282.133 2 104.300 3 180.760 4 687.733 ... 169904 163.800 169905 167.468 169906 180.700 169907 167.308 169908 214.787 Name: duration_milliseconds, Length: 169909, dtype: float64
4) Add a new column called popularity_score
that multiplies the popularity
column by 0.01¶
In [20]:
# Write your code here
df['popularity_score'] = df['popularity'] * 0.01
In [22]:
df.head(5)
Out[22]:
id | name | artists | duration_milliseconds | release_date | year | acoustic_level | dance_score | energy | instrumental | ... | loudness | speech_presence | tempo | valence | mode | key | popularity | explicit | duration_seconds | popularity_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6KbQ3uYMLKb5jDxLF7wYDD | Singende Bataillone 1. Teil | ['Carl Woitschach'] | 158648 | 1928 | 1928 | 0.995 | 0.708 | 0.1950 | 0.563 | ... | -12.428 | 0.0506 | 118.469 | 0.7790 | 1 | 10 | 0 | 0 | 158.648 | 0.00 |
1 | 6KuQTIu1KoTTkLXKrwlLPV | Fantasiestücke, Op. 111: Più tosto lento | ['Robert Schumann', 'Vladimir Horowitz'] | 282133 | 1928 | 1928 | 0.994 | 0.379 | 0.0135 | 0.901 | ... | -28.454 | 0.0462 | 83.972 | 0.0767 | 1 | 8 | 0 | 0 | 282.133 | 0.00 |
2 | 6L63VW0PibdM1HDSBoqnoM | Chapter 1.18 - Zamek kaniowski | ['Seweryn Goszczyński'] | 104300 | 1928 | 1928 | 0.604 | 0.749 | 0.2200 | 0.000 | ... | -19.924 | 0.9290 | 107.177 | 0.8800 | 0 | 5 | 0 | 0 | 104.300 | 0.00 |
3 | 6M94FkXd15sOAOQYRnWPN8 | Bebamos Juntos - Instrumental (Remasterizado) | ['Francisco Canaro'] | 180760 | 9/25/28 | 1928 | 0.995 | 0.781 | 0.1300 | 0.887 | ... | -14.734 | 0.0926 | 108.003 | 0.7200 | 0 | 1 | 0 | 0 | 180.760 | 0.00 |
4 | 6N6tiFZ9vLTSOIxkj8qKrd | Polonaise-Fantaisie in A-Flat Major, Op. 61 | ['Frédéric Chopin', 'Vladimir Horowitz'] | 687733 | 1928 | 1928 | 0.990 | 0.210 | 0.2040 | 0.908 | ... | -16.829 | 0.0424 | 62.149 | 0.0693 | 1 | 11 | 1 | 0 | 687.733 | 0.01 |
5 rows × 21 columns
5) Add a new column called is_popular
that assigns 1 to songs with popularity
greater than 70 and 0 otherwise¶
In [25]:
(df['popularity'] > 70).astype(int)
Out[25]:
0 0 1 0 2 0 3 0 4 0 .. 169904 1 169905 0 169906 0 169907 0 169908 0 Name: popularity, Length: 169909, dtype: int64
In [26]:
# Write your code here
df['is_popular'] = (df['popularity'] > 70).astype(int)
6) Add a new column called artist_count
that counts the number of artists in the artists
column¶
In [ ]:
In [50]:
df['artists'].str.count(',')+1
Out[50]:
0 1 1 2 2 1 3 1 4 2 .. 169904 2 169905 2 169906 2 169907 2 169908 1 Name: artists, Length: 169909, dtype: int64
In [51]:
# Write your code here
df['artist_count'] = df['artists'].str.count(',') + 1
In [ ]:
7) Add a new column called duration_minutes
that calculates the duration in minutes from the duration_seconds
column¶
In [53]:
# Write your code here
df['duration_minutes'] = df['duration_seconds'] / 60
8) Update the popularity
column by adding 10 to each value¶
In [55]:
# Write your code here
df['popularity'] = df['popularity'] + 10
9) Update the speech_presence
column by multiplying each value by 0.8¶
In [57]:
# Write your code here
df['speech_presence'] = df['speech_presence'] * 0.8
In [ ]:
In [60]:
df.head().T
Out[60]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
id | 6KbQ3uYMLKb5jDxLF7wYDD | 6KuQTIu1KoTTkLXKrwlLPV | 6L63VW0PibdM1HDSBoqnoM | 6M94FkXd15sOAOQYRnWPN8 | 6N6tiFZ9vLTSOIxkj8qKrd |
name | Singende Bataillone 1. Teil | Fantasiestücke, Op. 111: Più tosto lento | Chapter 1.18 - Zamek kaniowski | Bebamos Juntos - Instrumental (Remasterizado) | Polonaise-Fantaisie in A-Flat Major, Op. 61 |
artists | ['Carl Woitschach'] | ['Robert Schumann', 'Vladimir Horowitz'] | ['Seweryn Goszczyński'] | ['Francisco Canaro'] | ['Frédéric Chopin', 'Vladimir Horowitz'] |
duration_milliseconds | 158648 | 282133 | 104300 | 180760 | 687733 |
release_date | 1928 | 1928 | 1928 | 9/25/28 | 1928 |
year | 1928 | 1928 | 1928 | 1928 | 1928 |
acoustic_level | 0.995 | 0.994 | 0.604 | 0.995 | 0.99 |
dance_score | 0.708 | 0.379 | 0.749 | 0.781 | 0.21 |
energy | 0.195 | 0.0135 | 0.22 | 0.13 | 0.204 |
instrumental | 0.563 | 0.901 | 0.0 | 0.887 | 0.908 |
live_performance | 0.151 | 0.0763 | 0.119 | 0.111 | 0.098 |
loudness | -12.428 | -28.454 | -19.924 | -14.734 | -16.829 |
speech_presence | 0.04048 | 0.03696 | 0.7432 | 0.07408 | 0.03392 |
tempo | 118.469 | 83.972 | 107.177 | 108.003 | 62.149 |
valence | 0.779 | 0.0767 | 0.88 | 0.72 | 0.0693 |
mode | 1 | 1 | 0 | 0 | 1 |
key | 10 | 8 | 5 | 1 | 11 |
popularity | 10 | 10 | 10 | 10 | 11 |
explicit | 0 | 0 | 0 | 0 | 0 |
duration_seconds | 158.648 | 282.133 | 104.3 | 180.76 | 687.733 |
popularity_score | 0.0 | 0.0 | 0.0 | 0.0 | 0.01 |
is_popular | 0 | 0 | 0 | 0 | 0 |
artist_count | 1 | 2 | 1 | 1 | 2 |
duration_minutes | 2.644133 | 4.702217 | 1.738333 | 3.012667 | 11.462217 |
10) Update the dance_score
column by subtracting 0.1 from each value¶
In [61]:
# Write your code here
df['dance_score'] = df['dance_score'] - 0.1
11) Update the mode
column by replacing 0 with 'Minor' and 1 with 'Major'¶
In [64]:
df['mode'].map(lambda x: 'Major' if x == 1 else 'Minor')
Out[64]:
0 Major 1 Major 2 Minor 3 Minor 4 Major ... 169904 Major 169905 Major 169906 Major 169907 Major 169908 Major Name: mode, Length: 169909, dtype: object
In [65]:
# Write your code here
df['mode'] = df['mode'].map(lambda x: 'Major' if x == 1 else 'Minor')
12) Update the tempo
column by setting values greater than 150 to 150¶
In [67]:
df['tempo'] = df['tempo'].clip(upper=150)
In [ ]:
# Write your code here
13) Replace the key
column values 0, 1, and 2 with 'C', 'C#', and 'D', respectively¶
In [69]:
# Write your code here
df['key'] = df['key'].replace({0: 'C', 1: 'C#', 2: 'D', 3: 'D#', 4: 'E', 5: 'F', 6: 'F#', 7: 'G', 8:'G#', 9:'A', 10: 'A#', 11: 'B'})
In [72]:
df.head().T
Out[72]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
id | 6KbQ3uYMLKb5jDxLF7wYDD | 6KuQTIu1KoTTkLXKrwlLPV | 6L63VW0PibdM1HDSBoqnoM | 6M94FkXd15sOAOQYRnWPN8 | 6N6tiFZ9vLTSOIxkj8qKrd |
name | Singende Bataillone 1. Teil | Fantasiestücke, Op. 111: Più tosto lento | Chapter 1.18 - Zamek kaniowski | Bebamos Juntos - Instrumental (Remasterizado) | Polonaise-Fantaisie in A-Flat Major, Op. 61 |
artists | ['Carl Woitschach'] | ['Robert Schumann', 'Vladimir Horowitz'] | ['Seweryn Goszczyński'] | ['Francisco Canaro'] | ['Frédéric Chopin', 'Vladimir Horowitz'] |
duration_milliseconds | 158648 | 282133 | 104300 | 180760 | 687733 |
release_date | 1928 | 1928 | 1928 | 9/25/28 | 1928 |
year | 1928 | 1928 | 1928 | 1928 | 1928 |
acoustic_level | 0.995 | 0.994 | 0.604 | 0.995 | 0.99 |
dance_score | 0.608 | 0.279 | 0.649 | 0.681 | 0.11 |
energy | 0.195 | 0.0135 | 0.22 | 0.13 | 0.204 |
instrumental | 0.563 | 0.901 | 0.0 | 0.887 | 0.908 |
live_performance | 0.151 | 0.0763 | 0.119 | 0.111 | 0.098 |
loudness | -12.428 | -28.454 | -19.924 | -14.734 | -16.829 |
speech_presence | 0.04048 | 0.03696 | 0.7432 | 0.07408 | 0.03392 |
tempo | 118.469 | 83.972 | 107.177 | 108.003 | 62.149 |
valence | 0.779 | 0.0767 | 0.88 | 0.72 | 0.0693 |
mode | Major | Major | Minor | Minor | Major |
key | A# | G# | F | C# | B |
popularity | 10 | 10 | 10 | 10 | 11 |
explicit | 0 | 0 | 0 | 0 | 0 |
duration_seconds | 158.648 | 282.133 | 104.3 | 180.76 | 687.733 |
popularity_score | 0.0 | 0.0 | 0.0 | 0.0 | 0.01 |
is_popular | 0 | 0 | 0 | 0 | 0 |
artist_count | 1 | 2 | 1 | 1 | 2 |
duration_minutes | 2.644133 | 4.702217 | 1.738333 | 3.012667 | 11.462217 |
14) Replace the explicit
column values 0 and 1 with 'Not Explicit' and 'Explicit', respectively¶
In [73]:
# Write your code here
df['explicit'] = df['explicit'].replace({0: 'Not Explicit', 1: 'Explicit'})
15) Replace the year
column values before 1950 with 1950¶
In [76]:
# Write your code here
df['year'] = df['year'].clip(lower = 1950)
16) Replace the tempo
column values above 150 with 150 and values below 50 with 50¶
In [79]:
# Write your code here
df['tempo'] = df['tempo'].clip(lower = 50, upper = 150)
In [ ]: