Statement of Completion#d5442f66
Intro to Pandas for Data Analysis
medium
Exploring Data Science Salaries: A Pandas Series Analysis
Resolution
Activities
Project.ipynb
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.read_csv('Data_Science_Salaries.xls')
In [3]:
df.head()
Out[3]:
work_year | experience_level | employment_type | job_title | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
---|---|---|---|---|---|---|---|---|---|
0 | 2023 | SE | FT | Principal Data Scientist | 85847 | ES | 100 | ES | L |
1 | 2023 | MI | CT | ML Engineer | 30000 | US | 100 | US | S |
2 | 2023 | MI | CT | ML Engineer | 25500 | US | 100 | US | S |
3 | 2023 | SE | FT | Data Scientist | 175000 | CA | 100 | CA | M |
4 | 2023 | SE | FT | Data Scientist | 120000 | CA | 100 | CA | M |
In [1]:
df.info()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[1], line 1 ----> 1 df.info() 2 df.head NameError: name 'df' is not defined
1. Which method is used to display basic information about a pandas series?¶
In [4]:
import pandas as pd
df = pd.read_csv('Data_Science_Salaries.xls')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3755 entries, 0 to 3754 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 work_year 3755 non-null int64 1 experience_level 3755 non-null object 2 employment_type 3755 non-null object 3 job_title 3755 non-null object 4 salary_in_usd 3755 non-null int64 5 employee_residence 3755 non-null object 6 remote_ratio 3755 non-null int64 7 company_location 3755 non-null object 8 company_size 3755 non-null object dtypes: int64(3), object(6) memory usage: 264.1+ KB
2. Assign the employee_residence
column to the employee_residence_series
variable as a Series.¶
In [6]:
# Enter your code here
import pandas as pd
df = pd.read_csv('Data_Science_Salaries.xls')
employee_residence_series = pd.Series(df['employee_residence'])
employee_residence_series
Out[6]:
0 ES 1 US 2 US 3 CA 4 CA .. 3750 US 3751 US 3752 US 3753 US 3754 IN Name: employee_residence, Length: 3755, dtype: object
3. Create a Series from the experience_level
column and store the first 10 elements in the experience_level_series_10
variable.¶
In [7]:
# Enter your code here
import pandas as pd
experience_series = df['experience_level']
experience_level_series_10 = experience_series[:10]
experience_level_series_10
Out[7]:
0 SE 1 MI 2 MI 3 SE 4 SE 5 SE 6 SE 7 SE 8 SE 9 SE Name: experience_level, dtype: object
4. What does the len()
function return when applied to a Series?¶
In [ ]:
The total number of elements.
5. Find the unique values in company_size_series
along with their counts, and store the results in the company_size_counts_series
variable.¶
In [9]:
company_size_series = df['company_size']
company_size_counts_series = company_size_series.value_counts()
company_size_counts_series
Out[9]:
company_size M 3153 L 454 S 148 Name: count, dtype: int64
6. Which method calculates the average value of a Series?¶
In [ ]:
7. Calculate the mean, median, and standard deviation of salary_usd_series
, and store these values as a Series in the salary_details variable
.¶
In [14]:
import pandas as pd
salary_usd_series = df['salary_in_usd']
# Enter your code here
mean_salary = salary_usd_series.mean()
median_salary = salary_usd_series.median()
std_salary = salary_usd_series.std()
salary_details = pd.Series({
'Mean':mean_salary,
'Median':median_salary,
'Standard Deviation':std_salary
})
salary_details
Out[14]:
Mean 137570.389880 Median 135000.000000 Standard Deviation 63055.625278 dtype: float64
8. What method would you use to count unique values in a Series?¶
In [ ]:
9. Identify the top 5 most frequent job titles and store them in the top_5_job_titles
variable.¶
In [16]:
import pandas as pd
job_title_series = df['job_title']
top_5_job_titles = job_title_series.value_counts().head(5)
top_5_job_titles
Out[16]:
job_title Data Engineer 1040 Data Scientist 840 Data Analyst 612 Machine Learning Engineer 289 Analytics Engineer 103 Name: count, dtype: int64
10. Which method would you use to find the most frequent value in a Series?¶
In [18]:
import pandas as pd
job_title_series = df['job_title']
most_fq_value_job = job_title_series.mode()
most_fq_value_job
Out[18]:
0 Data Engineer Name: job_title, dtype: object
11. Calculate the 25th, 50th, and 75th percentiles of salary_usd_series
and store these values as a Series in the salary_quartiles
variable.¶
In [19]:
# Enter your code here
import pandas as pd
salary_usd_series = df['salary_in_usd']
percentile_25 = salary_usd_series.quantile(0.25)
percentile_50 = salary_usd_series.quantile(0.50)
percentile_75 = salary_usd_series.quantile(0.75)
salary_quartiles = pd.Series({
'25th Percentile':percentile_25,
'50th Percentile':percentile_50,
'75th Percentile':percentile_75
})
salary_quartiles
Out[19]:
25th Percentile 95000.0 50th Percentile 135000.0 75th Percentile 175000.0 dtype: float64
12. Which method is used to apply a function to every element in a Series?¶
In [ ]:
13. Create a new Series, increased_salary
, by applying a 10%
increase to each salary in the salary_usd_series
Series.¶
In [21]:
# Enter your code here
import pandas as pd
salary_usd_series = df['salary_in_usd']
increased_salary = pd.Series(salary_usd_series.apply(lambda x:x*1.1))
increased_salary
Out[21]:
0 94431.7 1 33000.0 2 28050.0 3 192500.0 4 132000.0 ... 3750 453200.0 3751 166100.0 3752 115500.0 3753 110000.0 3754 104131.5 Name: salary_in_usd, Length: 3755, dtype: float64
14. What does the operation series1 > series2 return?¶
In [ ]:
15. Compare the increased_salary
Series with the salary_usd_series
element-wise to check for equality.¶
In [23]:
# Enter your code here
import pandas as pd
salary_usd_series = df['salary_in_usd']
increased_salary = pd.Series(salary_usd_series.apply(lambda x:x*1.1))
salary_compare_series = increased_salary == salary_usd_series
salary_compare_series
Out[23]:
0 False 1 False 2 False 3 False 4 False ... 3750 False 3751 False 3752 False 3753 False 3754 False Name: salary_in_usd, Length: 3755, dtype: bool
In [ ]: