Statement of Completion#50eb93da
Intro to Pandas for Data Analysis
medium
Exploring Data Science Salaries: A Pandas Series Analysis
Resolution
Activities
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.read_csv('Data_Science_Salaries.xls')
In [3]:
df.head()
Out[3]:
work_year | experience_level | employment_type | job_title | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
---|---|---|---|---|---|---|---|---|---|
0 | 2023 | SE | FT | Principal Data Scientist | 85847 | ES | 100 | ES | L |
1 | 2023 | MI | CT | ML Engineer | 30000 | US | 100 | US | S |
2 | 2023 | MI | CT | ML Engineer | 25500 | US | 100 | US | S |
3 | 2023 | SE | FT | Data Scientist | 175000 | CA | 100 | CA | M |
4 | 2023 | SE | FT | Data Scientist | 120000 | CA | 100 | CA | M |
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3755 entries, 0 to 3754 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 work_year 3755 non-null int64 1 experience_level 3755 non-null object 2 employment_type 3755 non-null object 3 job_title 3755 non-null object 4 salary_in_usd 3755 non-null int64 5 employee_residence 3755 non-null object 6 remote_ratio 3755 non-null int64 7 company_location 3755 non-null object 8 company_size 3755 non-null object dtypes: int64(3), object(6) memory usage: 264.1+ KB
1. Which method is used to display basic information about a pandas series?¶
In [ ]:
2. Assign the employee_residence
column to the employee_residence_series
variable as a Series.¶
In [3]:
employee_residence_series = df['employee_residence']
3. Create a Series from the experience_level
column and store the first 10 elements in the experience_level_series_10
variable.¶
In [7]:
experience_level_series_10 = df['experience_level'].iloc[:10]
4. What does the len()
function return when applied to a Series?¶
In [ ]:
5. Find the unique values in company_size_series
along with their counts, and store the results in the company_size_series_counts
variable.¶
In [9]:
company_size_series = df['company_size']
company_size_counts_series = company_size_series.value_counts()
6. Which method calculates the average value of a Series?¶
In [ ]:
7. Calculate the mean, median, and standard deviation of salary_usd_series
, and store these values as a Series in the salary_details variable
.¶
In [15]:
salary_usd_series = df['salary_in_usd']
salary_details = pd.Series({
'Mean':salary_usd_series.mean(),
'Median':salary_usd_series.median(),
'Standard Deviation':salary_usd_series.std()
})
salary_details
Out[15]:
Mean 137570.389880 Median 135000.000000 Standard Deviation 63055.625278 dtype: float64
8. What method would you use to count unique values in a Series?¶
In [19]:
salary_usd_series.nunique()
Out[19]:
1035
9. Identify the top 5 most frequent job titles and store them in the top_5_job_titles
variable.¶
In [40]:
job_title_series = df['job_title'].value_counts()
top_5_job_titles = job_title_series.iloc[:5]
top_5_job_titles
Out[40]:
job_title Data Engineer 1040 Data Scientist 840 Data Analyst 612 Machine Learning Engineer 289 Analytics Engineer 103 Name: count, dtype: int64
10. Which method would you use to find the most frequent value in a Series?¶
In [45]:
job_title_series.mode()
Out[45]:
0 1 Name: count, dtype: int64
11. Calculate the 25th, 50th, and 75th percentiles of salary_usd_series
and store these values as a Series in the salary_quartiles
variable.¶
In [56]:
salary_quartiles = salary_usd_series.quantile([0.25,0.5,0.75])
salary_quartiles.index = ['25th Percentile','50th Percentile','75th Percentile']
salary_quartiles
Out[56]:
25th Percentile 95000.0 50th Percentile 135000.0 75th Percentile 175000.0 Name: salary_in_usd, dtype: float64
12. Which method is used to apply a function to every element in a Series?¶
In [ ]:
13. Create a new Series, increased_salary
, by applying a 10%
increase to each salary in the salary_usd_series
Series.¶
In [60]:
increased_salary = salary_usd_series + (salary_usd_series*10/100)
increased_salary
Out[60]:
0 94431.7 1 33000.0 2 28050.0 3 192500.0 4 132000.0 ... 3750 453200.0 3751 166100.0 3752 115500.0 3753 110000.0 3754 104131.5 Name: salary_in_usd, Length: 3755, dtype: float64
14. What does the operation series1 > series2 return?¶
In [ ]:
15. Compare the increased_salary
Series with the salary_usd_series
element-wise to check for equality.¶
In [65]:
salary_compare_series = increased_salary == salary_usd_series