In [1]:

import pandas as pd

In [2]:

# Lists of data
data = {'Revenue': [274515,200734,182527,181945,143015,129184,92224,85965,84893,
                    82345,77867,73620,69864,63191],
        'Employees': [147000,267937,135301,878429,163000,197000,158000,58604,
                      109700,350864,110600,364800,85858,243540],
        'Sector': ['Consumer Electronics','Consumer Electronics','Software Services',
                   'Chip Manufacturing','Software Services','Consumer Electronics',
                   'Consumer Electronics','Software Services','Consumer Electronics',
                   'Consumer Electronics','Chip Manufacturing','Software Services',
                   'Software Services','Consumer Electronics'],
        'Founding Date':['01-04-1976','13-01-1969','04-09-1998','20-02-1974',
                         '04-04-1975','15-09-1987','01-02-1984','04-02-2004',
                         '07-04-1946','01-01-1910','18-07-1968','16-06-1911',
                         '11-11-1998','07-03-1918'],
        'Country':['USA','South Korea','USA','Taiwan','USA','China','USA','USA',
                   'Japan','Japan','USA','USA','China','Japan']} 
index = ['Apple','Samsung','Alphabet','Foxconn','Microsoft','Huawei',
         'Dell Technologies','Meta','Sony','Hitachi','Intel','IBM',
         'Tencent','Panasonic']

In [3]:

df = pd.DataFrame(data, index=index)
df["Revenue per Employee"] = df["Revenue"] / df["Employees"]

In [4]:

df

Out[4]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Apple	274515	147000	Consumer Electronics	01-04-1976	USA	1.867449
Samsung	200734	267937	Consumer Electronics	13-01-1969	South Korea	0.749184
Alphabet	182527	135301	Software Services	04-09-1998	USA	1.349044
Foxconn	181945	878429	Chip Manufacturing	20-02-1974	Taiwan	0.207125
Microsoft	143015	163000	Software Services	04-04-1975	USA	0.877393
Huawei	129184	197000	Consumer Electronics	15-09-1987	China	0.655756
Dell Technologies	92224	158000	Consumer Electronics	01-02-1984	USA	0.583696
Meta	85965	58604	Software Services	04-02-2004	USA	1.466879
Sony	84893	109700	Consumer Electronics	07-04-1946	Japan	0.773865
Hitachi	82345	350864	Consumer Electronics	01-01-1910	Japan	0.234692
Intel	77867	110600	Chip Manufacturing	18-07-1968	USA	0.704042
IBM	73620	364800	Software Services	16-06-1911	USA	0.201809
Tencent	69864	85858	Software Services	11-11-1998	China	0.813716
Panasonic	63191	243540	Consumer Electronics	07-03-1918	Japan	0.259469

Selection Recap: Boolean Arrays¶

In [5]:

df.loc[[
    True,      # Apple
    False,     # Samsung
    True,      # Alphabet
    False,     # Foxconn
    True,      # Microsoft
    False,     # Huawei
    True,      # Dell
    True,      # Meta
    False,     # Sony
    False,     # Hitachi
    True,      # Intel
    True,      # IBM
    False,     # Tencent
    False,     # Panasonic
]]

Out[5]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Apple	274515	147000	Consumer Electronics	01-04-1976	USA	1.867449
Alphabet	182527	135301	Software Services	04-09-1998	USA	1.349044
Microsoft	143015	163000	Software Services	04-04-1975	USA	0.877393
Dell Technologies	92224	158000	Consumer Electronics	01-02-1984	USA	0.583696
Meta	85965	58604	Software Services	04-02-2004	USA	1.466879
Intel	77867	110600	Chip Manufacturing	18-07-1968	USA	0.704042
IBM	73620	364800	Software Services	16-06-1911	USA	0.201809

Result of a boolean expression in a single column:

In [6]:

df['Country'] == 'USA'

Out[6]:

Apple                 True
Samsung              False
Alphabet              True
Foxconn              False
Microsoft             True
Huawei               False
Dell Technologies     True
Meta                  True
Sony                 False
Hitachi              False
Intel                 True
IBM                   True
Tencent              False
Panasonic            False
Name: Country, dtype: bool

Putting it altogether:

In [7]:

df.loc[df['Country'] == 'USA']

Out[7]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Apple	274515	147000	Consumer Electronics	01-04-1976	USA	1.867449
Alphabet	182527	135301	Software Services	04-09-1998	USA	1.349044
Microsoft	143015	163000	Software Services	04-04-1975	USA	0.877393
Dell Technologies	92224	158000	Consumer Electronics	01-02-1984	USA	0.583696
Meta	85965	58604	Software Services	04-02-2004	USA	1.466879
Intel	77867	110600	Chip Manufacturing	18-07-1968	USA	0.704042
IBM	73620	364800	Software Services	16-06-1911	USA	0.201809

In [8]:

df.loc[df['Sector'] == 'Software Services']

Out[8]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Alphabet	182527	135301	Software Services	04-09-1998	USA	1.349044
Microsoft	143015	163000	Software Services	04-04-1975	USA	0.877393
Meta	85965	58604	Software Services	04-02-2004	USA	1.466879
IBM	73620	364800	Software Services	16-06-1911	USA	0.201809
Tencent	69864	85858	Software Services	11-11-1998	China	0.813716

In [9]:

df.sort_values(by='Employees')

Out[9]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Meta	85965	58604	Software Services	04-02-2004	USA	1.466879
Tencent	69864	85858	Software Services	11-11-1998	China	0.813716
Sony	84893	109700	Consumer Electronics	07-04-1946	Japan	0.773865
Intel	77867	110600	Chip Manufacturing	18-07-1968	USA	0.704042
Alphabet	182527	135301	Software Services	04-09-1998	USA	1.349044
Apple	274515	147000	Consumer Electronics	01-04-1976	USA	1.867449
Dell Technologies	92224	158000	Consumer Electronics	01-02-1984	USA	0.583696
Microsoft	143015	163000	Software Services	04-04-1975	USA	0.877393
Huawei	129184	197000	Consumer Electronics	15-09-1987	China	0.655756
Panasonic	63191	243540	Consumer Electronics	07-03-1918	Japan	0.259469
Samsung	200734	267937	Consumer Electronics	13-01-1969	South Korea	0.749184
Hitachi	82345	350864	Consumer Electronics	01-01-1910	Japan	0.234692
IBM	73620	364800	Software Services	16-06-1911	USA	0.201809
Foxconn	181945	878429	Chip Manufacturing	20-02-1974	Taiwan	0.207125

Activities¶

1. Select companies in the Sector 'Software Services'¶

In [11]:

df_software = df.loc[df['Sector'] == "Software Services"]

2. Select those companies with less Employees than Apple¶

In [14]:

df_low_employees = df.loc[df['Employees'] < df.loc['Apple', 'Employees']]
df_low_employees

Out[14]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Alphabet	182527	135301	Software Services	04-09-1998	USA	1.349044
Meta	85965	58604	Software Services	04-02-2004	USA	1.466879
Sony	84893	109700	Consumer Electronics	07-04-1946	Japan	0.773865
Intel	77867	110600	Chip Manufacturing	18-07-1968	USA	0.704042
Tencent	69864	85858	Software Services	11-11-1998	China	0.813716

3. Filter companies with Revenue above the average¶

In [15]:

df_high_revenue = df.loc[df['Revenue'] > df['Revenue'].mean()]

Combining Expression with Boolean Operators¶

Companies with Revenue greater than the mean:

In [19]:

df.loc[df['Revenue'] > df['Revenue'].mean()]

Out[19]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Apple	274515	147000	Consumer Electronics	01-04-1976	USA	1.867449
Samsung	200734	267937	Consumer Electronics	13-01-1969	South Korea	0.749184
Alphabet	182527	135301	Software Services	04-09-1998	USA	1.349044
Foxconn	181945	878429	Chip Manufacturing	20-02-1974	Taiwan	0.207125
Microsoft	143015	163000	Software Services	04-04-1975	USA	0.877393
Huawei	129184	197000	Consumer Electronics	15-09-1987	China	0.655756

Companies with Revenue greater than the mean and from the USA:

In [20]:

df.loc[(df['Revenue'] > df['Revenue'].mean()) & (df['Country'] == 'USA')]

Out[20]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Apple	274515	147000	Consumer Electronics	01-04-1976	USA	1.867449
Alphabet	182527	135301	Software Services	04-09-1998	USA	1.349044
Microsoft	143015	163000	Software Services	04-04-1975	USA	0.877393

Companies that are NOT American (note the ~ operator):

In [21]:

df.loc[(df['Revenue'] > df['Revenue'].mean()) & ~(df['Country'] == 'USA')]

Out[21]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Samsung	200734	267937	Consumer Electronics	13-01-1969	South Korea	0.749184
Foxconn	181945	878429	Chip Manufacturing	20-02-1974	Taiwan	0.207125
Huawei	129184	197000	Consumer Electronics	15-09-1987	China	0.655756

Which is equivalent to:

In [22]:

df.loc[(df['Revenue'] > df['Revenue'].mean()) & (df['Country'] != 'USA')]

Out[22]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Samsung	200734	267937	Consumer Electronics	13-01-1969	South Korea	0.749184
Foxconn	181945	878429	Chip Manufacturing	20-02-1974	Taiwan	0.207125
Huawei	129184	197000	Consumer Electronics	15-09-1987	China	0.655756

Important! Parentheses are needed between the expressions¶

The lack of parentheses might yield unexpected results or errors:

In [ ]:

df.loc[(df['Revenue'] > df['Revenue'].mean()) & df['Country'] != 'USA']

Activities¶

1. Select non-American companies with Employees above the mean¶

In [23]:

df_employee_non_usa = df.loc[(df['Employees'] > df['Employees'].mean()) & (df['Country'] != "USA")]

2. Find those non-american companies with Revenue per Employee below the median¶

In [30]:

df_low_rev_employee = df.loc[
    (df['Revenue per Employee'] < df['Revenue per Employee'].median()) & 
    (df['Country'] != "USA")
]

3. American companies, from 'Software Services' sector, that have Revenue/employee above the mean¶

In [32]:

df_usa_software_high_efficiency = df_low_rev_employee = df.loc[(df['Sector'] == 'Software Services') & (df['Revenue per Employee'] > df['Revenue per Employee'].mean()) & (df['Country'] == "USA")]

In [33]:

df_usa_software_high_efficiency

Out[33]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Alphabet	182527	135301	Software Services	04-09-1998	USA	1.349044
Microsoft	143015	163000	Software Services	04-04-1975	USA	0.877393
Meta	85965	58604	Software Services	04-02-2004	USA	1.466879

The `.query()` method¶

Companies from the USA:¶

In [35]:

df.query("Country == 'USA'")

Out[35]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Apple	274515	147000	Consumer Electronics	01-04-1976	USA	1.867449
Alphabet	182527	135301	Software Services	04-09-1998	USA	1.349044
Microsoft	143015	163000	Software Services	04-04-1975	USA	0.877393
Dell Technologies	92224	158000	Consumer Electronics	01-02-1984	USA	0.583696
Meta	85965	58604	Software Services	04-02-2004	USA	1.466879
Intel	77867	110600	Chip Manufacturing	18-07-1968	USA	0.704042
IBM	73620	364800	Software Services	16-06-1911	USA	0.201809

Companies from countries that are NOT the USA:¶

In [36]:

df.query("Country != 'USA'")

Out[36]:

	Revenue	Employees	Sector	Founding Date	Country	Revenue per Employee
Samsung	200734	267937	Consumer Electronics	13-01-1969	South Korea	0.749184
Foxconn	181945	878429	Chip Manufacturing	20-02-1974	Taiwan	0.207125
Huawei	129184	197000	Consumer Electronics	15-09-1987	China	0.655756
Sony	84893	109700	Consumer Electronics	07-04-1946	Japan	0.773865
Hitachi	82345	350864	Consumer Electronics	01-01-1910	Japan	0.234692
Tencent	69864	85858	Software Services	11-11-1998	China	0.813716
Panasonic	63191	243540	Consumer Electronics	07-03-1918	Japan	0.259469

Boolean Operators¶

Companies NOT from the USA and from the "Consumer Electronics" sector:¶

In [ ]:

df.query("Country != 'USA' and Sector == 'Consumer Electronics'")

Same as above, but using the `not` operator¶

In [ ]:

df.query("Sector == 'Consumer Electronics' and not (Country == 'USA')")

Companies that are from China or Taiwan¶

In [ ]:

df.query("Country == 'China' or Country == 'Taiwan'")

Special characters: variables and columns with whitespaces¶

Querying a column with whitespaces in its name:

In [ ]:

df.query("`Founding Date` == '04-02-2004'")

Referencing extrenal variables:

In [ ]:

mean_revenue = df['Revenue'].mean()
df.query("Revenue > @mean_revenue")

Another example:

In [ ]:

median_rev_per_employee = df['Revenue per Employee'].median()
df.query("`Revenue per Employee` > @median_rev_per_employee")

Complex expressions in `.loc`¶

In [ ]:

df.loc[df['Revenue per Employee'] > 1]

Bye bye Revenue per Employee:

In [ ]:

del df['Revenue per Employee']
df.head()

In [ ]:

df.loc[
    (df['Revenue'] / df['Employees']) > 1
]

In [ ]:

(df['Revenue'] / df['Employees'])

In [ ]:

(df['Revenue'] / df['Employees']) > 1

Statement of Completion#c93c43f8

Intro to Pandas for Data Analysis

Filtering and Selection with Dataframes

Selection Recap: Boolean Arrays¶

Activities¶

1. Select companies in the Sector 'Software Services'¶

2. Select those companies with less Employees than Apple¶

3. Filter companies with Revenue above the average¶

Combining Expression with Boolean Operators¶

Important! Parentheses are needed between the expressions¶

Activities¶

1. Select non-American companies with Employees above the mean¶

2. Find those non-american companies with Revenue per Employee below the median¶

3. American companies, from 'Software Services' sector, that have Revenue/employee above the mean¶

The `.query()` method¶

Companies from the USA:¶

Companies from countries that are NOT the USA:¶

Boolean Operators¶

Companies NOT from the USA and from the "Consumer Electronics" sector:¶

Same as above, but using the `not` operator¶

Companies that are from China or Taiwan¶

Special characters: variables and columns with whitespaces¶

Complex expressions in `.loc`¶

Statement of Completion#c93c43f8

Intro to Pandas for Data Analysis

Filtering and Selection with Dataframes

Selection Recap: Boolean Arrays¶

Activities¶

1. Select companies in the Sector 'Software Services'¶

2. Select those companies with less Employees than Apple¶

3. Filter companies with Revenue above the average¶

Combining Expression with Boolean Operators¶

Important! Parentheses are needed between the expressions¶

Activities¶

1. Select non-American companies with Employees above the mean¶

2. Find those non-american companies with Revenue per Employee below the median¶

3. American companies, from 'Software Services' sector, that have Revenue/employee above the mean¶

The .query() method¶

Companies from the USA:¶

Companies from countries that are NOT the USA:¶

Boolean Operators¶

Companies NOT from the USA and from the "Consumer Electronics" sector:¶

Same as above, but using the not operator¶

Companies that are from China or Taiwan¶

Special characters: variables and columns with whitespaces¶

Complex expressions in .loc¶

The `.query()` method¶

Same as above, but using the `not` operator¶

Complex expressions in `.loc`¶