Statement of Completion#c93c43f8
Intro to Pandas for Data Analysis
easy
Filtering and Selection with Dataframes
Resolution
Activities
Project.ipynb
In [1]:
import pandas as pd
In [2]:
# Lists of data
data = {'Revenue': [274515,200734,182527,181945,143015,129184,92224,85965,84893,
82345,77867,73620,69864,63191],
'Employees': [147000,267937,135301,878429,163000,197000,158000,58604,
109700,350864,110600,364800,85858,243540],
'Sector': ['Consumer Electronics','Consumer Electronics','Software Services',
'Chip Manufacturing','Software Services','Consumer Electronics',
'Consumer Electronics','Software Services','Consumer Electronics',
'Consumer Electronics','Chip Manufacturing','Software Services',
'Software Services','Consumer Electronics'],
'Founding Date':['01-04-1976','13-01-1969','04-09-1998','20-02-1974',
'04-04-1975','15-09-1987','01-02-1984','04-02-2004',
'07-04-1946','01-01-1910','18-07-1968','16-06-1911',
'11-11-1998','07-03-1918'],
'Country':['USA','South Korea','USA','Taiwan','USA','China','USA','USA',
'Japan','Japan','USA','USA','China','Japan']}
index = ['Apple','Samsung','Alphabet','Foxconn','Microsoft','Huawei',
'Dell Technologies','Meta','Sony','Hitachi','Intel','IBM',
'Tencent','Panasonic']
In [3]:
df = pd.DataFrame(data, index=index)
df["Revenue per Employee"] = df["Revenue"] / df["Employees"]
In [4]:
df
Out[4]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Apple | 274515 | 147000 | Consumer Electronics | 01-04-1976 | USA | 1.867449 |
| Samsung | 200734 | 267937 | Consumer Electronics | 13-01-1969 | South Korea | 0.749184 |
| Alphabet | 182527 | 135301 | Software Services | 04-09-1998 | USA | 1.349044 |
| Foxconn | 181945 | 878429 | Chip Manufacturing | 20-02-1974 | Taiwan | 0.207125 |
| Microsoft | 143015 | 163000 | Software Services | 04-04-1975 | USA | 0.877393 |
| Huawei | 129184 | 197000 | Consumer Electronics | 15-09-1987 | China | 0.655756 |
| Dell Technologies | 92224 | 158000 | Consumer Electronics | 01-02-1984 | USA | 0.583696 |
| Meta | 85965 | 58604 | Software Services | 04-02-2004 | USA | 1.466879 |
| Sony | 84893 | 109700 | Consumer Electronics | 07-04-1946 | Japan | 0.773865 |
| Hitachi | 82345 | 350864 | Consumer Electronics | 01-01-1910 | Japan | 0.234692 |
| Intel | 77867 | 110600 | Chip Manufacturing | 18-07-1968 | USA | 0.704042 |
| IBM | 73620 | 364800 | Software Services | 16-06-1911 | USA | 0.201809 |
| Tencent | 69864 | 85858 | Software Services | 11-11-1998 | China | 0.813716 |
| Panasonic | 63191 | 243540 | Consumer Electronics | 07-03-1918 | Japan | 0.259469 |
Selection Recap: Boolean Arrays¶
In [5]:
df.loc[[
True, # Apple
False, # Samsung
True, # Alphabet
False, # Foxconn
True, # Microsoft
False, # Huawei
True, # Dell
True, # Meta
False, # Sony
False, # Hitachi
True, # Intel
True, # IBM
False, # Tencent
False, # Panasonic
]]
Out[5]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Apple | 274515 | 147000 | Consumer Electronics | 01-04-1976 | USA | 1.867449 |
| Alphabet | 182527 | 135301 | Software Services | 04-09-1998 | USA | 1.349044 |
| Microsoft | 143015 | 163000 | Software Services | 04-04-1975 | USA | 0.877393 |
| Dell Technologies | 92224 | 158000 | Consumer Electronics | 01-02-1984 | USA | 0.583696 |
| Meta | 85965 | 58604 | Software Services | 04-02-2004 | USA | 1.466879 |
| Intel | 77867 | 110600 | Chip Manufacturing | 18-07-1968 | USA | 0.704042 |
| IBM | 73620 | 364800 | Software Services | 16-06-1911 | USA | 0.201809 |
Result of a boolean expression in a single column:
In [6]:
df['Country'] == 'USA'
Out[6]:
Apple True Samsung False Alphabet True Foxconn False Microsoft True Huawei False Dell Technologies True Meta True Sony False Hitachi False Intel True IBM True Tencent False Panasonic False Name: Country, dtype: bool
Putting it altogether:
In [7]:
df.loc[df['Country'] == 'USA']
Out[7]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Apple | 274515 | 147000 | Consumer Electronics | 01-04-1976 | USA | 1.867449 |
| Alphabet | 182527 | 135301 | Software Services | 04-09-1998 | USA | 1.349044 |
| Microsoft | 143015 | 163000 | Software Services | 04-04-1975 | USA | 0.877393 |
| Dell Technologies | 92224 | 158000 | Consumer Electronics | 01-02-1984 | USA | 0.583696 |
| Meta | 85965 | 58604 | Software Services | 04-02-2004 | USA | 1.466879 |
| Intel | 77867 | 110600 | Chip Manufacturing | 18-07-1968 | USA | 0.704042 |
| IBM | 73620 | 364800 | Software Services | 16-06-1911 | USA | 0.201809 |
In [8]:
df.loc[df['Sector'] == 'Software Services']
Out[8]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Alphabet | 182527 | 135301 | Software Services | 04-09-1998 | USA | 1.349044 |
| Microsoft | 143015 | 163000 | Software Services | 04-04-1975 | USA | 0.877393 |
| Meta | 85965 | 58604 | Software Services | 04-02-2004 | USA | 1.466879 |
| IBM | 73620 | 364800 | Software Services | 16-06-1911 | USA | 0.201809 |
| Tencent | 69864 | 85858 | Software Services | 11-11-1998 | China | 0.813716 |
In [9]:
df.sort_values(by='Employees')
Out[9]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Meta | 85965 | 58604 | Software Services | 04-02-2004 | USA | 1.466879 |
| Tencent | 69864 | 85858 | Software Services | 11-11-1998 | China | 0.813716 |
| Sony | 84893 | 109700 | Consumer Electronics | 07-04-1946 | Japan | 0.773865 |
| Intel | 77867 | 110600 | Chip Manufacturing | 18-07-1968 | USA | 0.704042 |
| Alphabet | 182527 | 135301 | Software Services | 04-09-1998 | USA | 1.349044 |
| Apple | 274515 | 147000 | Consumer Electronics | 01-04-1976 | USA | 1.867449 |
| Dell Technologies | 92224 | 158000 | Consumer Electronics | 01-02-1984 | USA | 0.583696 |
| Microsoft | 143015 | 163000 | Software Services | 04-04-1975 | USA | 0.877393 |
| Huawei | 129184 | 197000 | Consumer Electronics | 15-09-1987 | China | 0.655756 |
| Panasonic | 63191 | 243540 | Consumer Electronics | 07-03-1918 | Japan | 0.259469 |
| Samsung | 200734 | 267937 | Consumer Electronics | 13-01-1969 | South Korea | 0.749184 |
| Hitachi | 82345 | 350864 | Consumer Electronics | 01-01-1910 | Japan | 0.234692 |
| IBM | 73620 | 364800 | Software Services | 16-06-1911 | USA | 0.201809 |
| Foxconn | 181945 | 878429 | Chip Manufacturing | 20-02-1974 | Taiwan | 0.207125 |
Activities¶
1. Select companies in the Sector 'Software Services'¶
In [11]:
df_software = df.loc[df['Sector'] == "Software Services"]
2. Select those companies with less Employees than Apple¶
In [14]:
df_low_employees = df.loc[df['Employees'] < df.loc['Apple', 'Employees']]
df_low_employees
Out[14]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Alphabet | 182527 | 135301 | Software Services | 04-09-1998 | USA | 1.349044 |
| Meta | 85965 | 58604 | Software Services | 04-02-2004 | USA | 1.466879 |
| Sony | 84893 | 109700 | Consumer Electronics | 07-04-1946 | Japan | 0.773865 |
| Intel | 77867 | 110600 | Chip Manufacturing | 18-07-1968 | USA | 0.704042 |
| Tencent | 69864 | 85858 | Software Services | 11-11-1998 | China | 0.813716 |
3. Filter companies with Revenue above the average¶
In [15]:
df_high_revenue = df.loc[df['Revenue'] > df['Revenue'].mean()]
Combining Expression with Boolean Operators¶
Companies with Revenue greater than the mean:
In [19]:
df.loc[df['Revenue'] > df['Revenue'].mean()]
Out[19]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Apple | 274515 | 147000 | Consumer Electronics | 01-04-1976 | USA | 1.867449 |
| Samsung | 200734 | 267937 | Consumer Electronics | 13-01-1969 | South Korea | 0.749184 |
| Alphabet | 182527 | 135301 | Software Services | 04-09-1998 | USA | 1.349044 |
| Foxconn | 181945 | 878429 | Chip Manufacturing | 20-02-1974 | Taiwan | 0.207125 |
| Microsoft | 143015 | 163000 | Software Services | 04-04-1975 | USA | 0.877393 |
| Huawei | 129184 | 197000 | Consumer Electronics | 15-09-1987 | China | 0.655756 |
Companies with Revenue greater than the mean and from the USA:
In [20]:
df.loc[(df['Revenue'] > df['Revenue'].mean()) & (df['Country'] == 'USA')]
Out[20]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Apple | 274515 | 147000 | Consumer Electronics | 01-04-1976 | USA | 1.867449 |
| Alphabet | 182527 | 135301 | Software Services | 04-09-1998 | USA | 1.349044 |
| Microsoft | 143015 | 163000 | Software Services | 04-04-1975 | USA | 0.877393 |
Companies that are NOT American (note the ~ operator):
In [21]:
df.loc[(df['Revenue'] > df['Revenue'].mean()) & ~(df['Country'] == 'USA')]
Out[21]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Samsung | 200734 | 267937 | Consumer Electronics | 13-01-1969 | South Korea | 0.749184 |
| Foxconn | 181945 | 878429 | Chip Manufacturing | 20-02-1974 | Taiwan | 0.207125 |
| Huawei | 129184 | 197000 | Consumer Electronics | 15-09-1987 | China | 0.655756 |
Which is equivalent to:
In [22]:
df.loc[(df['Revenue'] > df['Revenue'].mean()) & (df['Country'] != 'USA')]
Out[22]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Samsung | 200734 | 267937 | Consumer Electronics | 13-01-1969 | South Korea | 0.749184 |
| Foxconn | 181945 | 878429 | Chip Manufacturing | 20-02-1974 | Taiwan | 0.207125 |
| Huawei | 129184 | 197000 | Consumer Electronics | 15-09-1987 | China | 0.655756 |
Important! Parentheses are needed between the expressions¶
The lack of parentheses might yield unexpected results or errors:
In [ ]:
df.loc[(df['Revenue'] > df['Revenue'].mean()) & df['Country'] != 'USA']
Activities¶
1. Select non-American companies with Employees above the mean¶
In [23]:
df_employee_non_usa = df.loc[(df['Employees'] > df['Employees'].mean()) & (df['Country'] != "USA")]
2. Find those non-american companies with Revenue per Employee below the median¶
In [30]:
df_low_rev_employee = df.loc[
(df['Revenue per Employee'] < df['Revenue per Employee'].median()) &
(df['Country'] != "USA")
]
3. American companies, from 'Software Services' sector, that have Revenue/employee above the mean¶
In [32]:
df_usa_software_high_efficiency = df_low_rev_employee = df.loc[(df['Sector'] == 'Software Services') & (df['Revenue per Employee'] > df['Revenue per Employee'].mean()) & (df['Country'] == "USA")]
In [33]:
df_usa_software_high_efficiency
Out[33]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Alphabet | 182527 | 135301 | Software Services | 04-09-1998 | USA | 1.349044 |
| Microsoft | 143015 | 163000 | Software Services | 04-04-1975 | USA | 0.877393 |
| Meta | 85965 | 58604 | Software Services | 04-02-2004 | USA | 1.466879 |
The .query() method¶
Companies from the USA:¶
In [35]:
df.query("Country == 'USA'")
Out[35]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Apple | 274515 | 147000 | Consumer Electronics | 01-04-1976 | USA | 1.867449 |
| Alphabet | 182527 | 135301 | Software Services | 04-09-1998 | USA | 1.349044 |
| Microsoft | 143015 | 163000 | Software Services | 04-04-1975 | USA | 0.877393 |
| Dell Technologies | 92224 | 158000 | Consumer Electronics | 01-02-1984 | USA | 0.583696 |
| Meta | 85965 | 58604 | Software Services | 04-02-2004 | USA | 1.466879 |
| Intel | 77867 | 110600 | Chip Manufacturing | 18-07-1968 | USA | 0.704042 |
| IBM | 73620 | 364800 | Software Services | 16-06-1911 | USA | 0.201809 |
Companies from countries that are NOT the USA:¶
In [36]:
df.query("Country != 'USA'")
Out[36]:
| Revenue | Employees | Sector | Founding Date | Country | Revenue per Employee | |
|---|---|---|---|---|---|---|
| Samsung | 200734 | 267937 | Consumer Electronics | 13-01-1969 | South Korea | 0.749184 |
| Foxconn | 181945 | 878429 | Chip Manufacturing | 20-02-1974 | Taiwan | 0.207125 |
| Huawei | 129184 | 197000 | Consumer Electronics | 15-09-1987 | China | 0.655756 |
| Sony | 84893 | 109700 | Consumer Electronics | 07-04-1946 | Japan | 0.773865 |
| Hitachi | 82345 | 350864 | Consumer Electronics | 01-01-1910 | Japan | 0.234692 |
| Tencent | 69864 | 85858 | Software Services | 11-11-1998 | China | 0.813716 |
| Panasonic | 63191 | 243540 | Consumer Electronics | 07-03-1918 | Japan | 0.259469 |
Boolean Operators¶
Companies NOT from the USA and from the "Consumer Electronics" sector:¶
In [ ]:
df.query("Country != 'USA' and Sector == 'Consumer Electronics'")
Same as above, but using the not operator¶
In [ ]:
df.query("Sector == 'Consumer Electronics' and not (Country == 'USA')")
Companies that are from China or Taiwan¶
In [ ]:
df.query("Country == 'China' or Country == 'Taiwan'")
Special characters: variables and columns with whitespaces¶
Querying a column with whitespaces in its name:
In [ ]:
df.query("`Founding Date` == '04-02-2004'")
Referencing extrenal variables:
In [ ]:
mean_revenue = df['Revenue'].mean()
df.query("Revenue > @mean_revenue")
Another example:
In [ ]:
median_rev_per_employee = df['Revenue per Employee'].median()
df.query("`Revenue per Employee` > @median_rev_per_employee")
Complex expressions in .loc¶
In [ ]:
df.loc[df['Revenue per Employee'] > 1]
Bye bye Revenue per Employee:
In [ ]:
del df['Revenue per Employee']
df.head()
In [ ]:
df.loc[
(df['Revenue'] / df['Employees']) > 1
]
In [ ]:
(df['Revenue'] / df['Employees'])
In [ ]:
(df['Revenue'] / df['Employees']) > 1