Statement of Completion#830ef269
Intro to Pandas for Data Analysis
easy
Intro to Pandas Series
Resolution
Activities
Project.ipynb
In [1]:
import pandas as pd
Intro to Series¶
Take a look the following list of companies:
We'll represent them using a Series in the following way:
In [2]:
companies = [
'Apple', 'Samsung', 'Alphabet', 'Foxconn',
'Microsoft', 'Huawei', 'Dell Technologies',
'Meta', 'Sony', 'Hitachi', 'Intel',
'IBM', 'Tencent', 'Panasonic'
]
In [3]:
s = pd.Series([
274515, 200734, 182527, 181945, 143015,
129184, 92224, 85965, 84893, 82345,
77867, 73620, 69864, 63191],
index=companies,
name="Top Technology Companies by Revenue")
In [4]:
s
Out[4]:
Apple 274515 Samsung 200734 Alphabet 182527 Foxconn 181945 Microsoft 143015 Huawei 129184 Dell Technologies 92224 Meta 85965 Sony 84893 Hitachi 82345 Intel 77867 IBM 73620 Tencent 69864 Panasonic 63191 Name: Top Technology Companies by Revenue, dtype: int64
1. Check your knowledge: build a series¶
Create a series called my_series
In [12]:
my_series = pd.Series([9,11,-5],index=['a','b','c'], name="My First Series")
In [13]:
my_series
Out[13]:
a 9 b 11 c -5 Name: My First Series, dtype: int64
Basic selection and location¶
Selecting by index:¶
In [6]:
s['Apple']
Out[6]:
274515
.loc
is the preferred way:
In [7]:
s.loc['Apple']
Out[7]:
274515
Selection by position:¶
In [8]:
s.iloc[0]
Out[8]:
274515
In [9]:
s.iloc[-1]
Out[9]:
63191
Errors in selection:¶
In [14]:
# this code will fail
s.loc["Non existent company"]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File /usr/local/lib/python3.11/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key) 3804 try: -> 3805 return self._engine.get_loc(casted_key) 3806 except KeyError as err: File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc() File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'Non existent company' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[14], line 2 1 # this code will fail ----> 2 s.loc["Non existent company"] File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key) 1189 maybe_callable = com.apply_if_callable(key, self.obj) 1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable) -> 1191 return self._getitem_axis(maybe_callable, axis=axis) File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis) 1429 # fall thru to straight lookup 1430 self._validate_key(key, axis) -> 1431 return self._get_label(key, axis=axis) File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1381, in _LocIndexer._get_label(self, label, axis) 1379 def _get_label(self, label, axis: AxisInt): 1380 # GH#5567 this will fail if the label is not present in the axis. -> 1381 return self.obj.xs(label, axis=axis) File /usr/local/lib/python3.11/site-packages/pandas/core/generic.py:4301, in NDFrame.xs(self, key, axis, level, drop_level) 4299 new_index = index[loc] 4300 else: -> 4301 loc = index.get_loc(key) 4303 if isinstance(loc, np.ndarray): 4304 if loc.dtype == np.bool_: File /usr/local/lib/python3.11/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key) 3807 if isinstance(casted_key, slice) or ( 3808 isinstance(casted_key, abc.Iterable) 3809 and any(isinstance(x, slice) for x in casted_key) 3810 ): 3811 raise InvalidIndexError(key) -> 3812 raise KeyError(key) from err 3813 except TypeError: 3814 # If we have a listlike key, _check_indexing_error will raise 3815 # InvalidIndexError. Otherwise we fall through and re-raise 3816 # the TypeError. 3817 self._check_indexing_error(key) KeyError: 'Non existent company'
In [15]:
# This code also fails, 132 it's out of boundaries
# (there are not so many elements in the Series)
s.iloc[132]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[15], line 3 1 # This code also fails, 132 it's out of boundaries 2 # (there are not so many elements in the Series) ----> 3 s.iloc[132] File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key) 1189 maybe_callable = com.apply_if_callable(key, self.obj) 1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable) -> 1191 return self._getitem_axis(maybe_callable, axis=axis) File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1752, in _iLocIndexer._getitem_axis(self, key, axis) 1749 raise TypeError("Cannot index by location index with a non-integer key") 1751 # validate the location -> 1752 self._validate_integer(key, axis) 1754 return self.obj._ixs(key, axis=axis) File /usr/local/lib/python3.11/site-packages/pandas/core/indexing.py:1685, in _iLocIndexer._validate_integer(self, key, axis) 1683 len_axis = len(self.obj._get_axis(axis)) 1684 if key >= len_axis or key < -len_axis: -> 1685 raise IndexError("single positional indexer is out-of-bounds") IndexError: single positional indexer is out-of-bounds
We could prevent these errors using the membership check in
:
In [16]:
"Apple" in s
Out[16]:
True
In [17]:
"Snapchat" in s
Out[17]:
False
Multiple selection:¶
By index:
In [18]:
s[['Apple', 'Intel', 'Sony']]
Out[18]:
Apple 274515 Intel 77867 Sony 84893 Name: Top Technology Companies by Revenue, dtype: int64
By position:
In [19]:
s.iloc[[0, 5, -1]]
Out[19]:
Apple 274515 Huawei 129184 Panasonic 63191 Name: Top Technology Companies by Revenue, dtype: int64
Activities:¶
2. Check your knowledge: location by index¶
Select the revenue of Intel
and store it in a variable named intel_revenue
:
In [21]:
intel_revenue = s.loc['Intel']
intel_revenue
Out[21]:
77867
3. Check your knowledge: location by position¶
Select the revenue of the "second to last" element in our series s
and store it in a variable named second_to_last
:
In [24]:
second_to_last = s.iloc[-2]
second_to_last
Out[24]:
69864
4. Check your knowledge: multiple selection¶
Use multiple label selection to retrieve the revenues of the companies:
- Samsung
- Dell Technologies
- Panasonic
- Microsoft
In [27]:
sub_series = s.loc[['Samsung','Dell Technologies','Panasonic','Microsoft']]
sub_series
Out[27]:
Samsung 200734 Dell Technologies 92224 Panasonic 63191 Microsoft 143015 Name: Top Technology Companies by Revenue, dtype: int64
Series Attributes and Methods¶
In [29]:
s.head()
Out[29]:
Apple 274515 Samsung 200734 Alphabet 182527 Foxconn 181945 Microsoft 143015 Name: Top Technology Companies by Revenue, dtype: int64
In [30]:
s.tail()
Out[30]:
Hitachi 82345 Intel 77867 IBM 73620 Tencent 69864 Panasonic 63191 Name: Top Technology Companies by Revenue, dtype: int64
Main Attributes¶
The underlying data:
In [31]:
s.values
Out[31]:
array([274515, 200734, 182527, 181945, 143015, 129184, 92224, 85965, 84893, 82345, 77867, 73620, 69864, 63191])
The index:
In [32]:
s.index
Out[32]:
Index(['Apple', 'Samsung', 'Alphabet', 'Foxconn', 'Microsoft', 'Huawei', 'Dell Technologies', 'Meta', 'Sony', 'Hitachi', 'Intel', 'IBM', 'Tencent', 'Panasonic'], dtype='object')
The name (if any):
In [33]:
s.name
Out[33]:
'Top Technology Companies by Revenue'
The type associated with the values:
In [34]:
s.dtype
Out[34]:
dtype('int64')
The size of the Series:
In [35]:
s.size
Out[35]:
14
len
also works:
In [36]:
len(s)
Out[36]:
14
Statistical methods¶
In [37]:
s.describe()
Out[37]:
count 14.000000 mean 124420.642857 std 63686.481231 min 63191.000000 25% 78986.500000 50% 89094.500000 75% 172212.500000 max 274515.000000 Name: Top Technology Companies by Revenue, dtype: float64
In [38]:
s.mean()
Out[38]:
124420.64285714286
In [39]:
s.median()
Out[39]:
89094.5
In [40]:
s.std()
Out[40]:
63686.48123135607
In [41]:
s.min(), s.max()
Out[41]:
(63191, 274515)
In [42]:
s.quantile(.75)
Out[42]:
172212.5
In [43]:
s.quantile(.99)
Out[43]:
264923.47
Activities¶
In [44]:
# Run this cell to complete the activity
american_companies = s[[
'Meta', 'IBM', 'Microsoft',
'Dell Technologies', 'Apple', 'Intel', 'Alphabet'
]]
american_companies
Out[44]:
Meta 85965 IBM 73620 Microsoft 143015 Dell Technologies 92224 Apple 274515 Intel 77867 Alphabet 182527 Name: Top Technology Companies by Revenue, dtype: int64
5. What's the average revenue of American Companies?¶
In [47]:
american_companies.mean()
Out[47]:
132819.0
6. What's the median revenue of American Companies?¶
In [48]:
american_companies.median()
Out[48]:
92224.0
Sorting Series¶
Sorting by values or Index¶
Sorting by values, notice it's in "ascending mode":
In [49]:
s.sort_values()
Out[49]:
Panasonic 63191 Tencent 69864 IBM 73620 Intel 77867 Hitachi 82345 Sony 84893 Meta 85965 Dell Technologies 92224 Huawei 129184 Microsoft 143015 Foxconn 181945 Alphabet 182527 Samsung 200734 Apple 274515 Name: Top Technology Companies by Revenue, dtype: int64
Sorting by index (lexicographically by company's name), notice it's in ascending mode:
In [50]:
s.sort_index()
Out[50]:
Alphabet 182527 Apple 274515 Dell Technologies 92224 Foxconn 181945 Hitachi 82345 Huawei 129184 IBM 73620 Intel 77867 Meta 85965 Microsoft 143015 Panasonic 63191 Samsung 200734 Sony 84893 Tencent 69864 Name: Top Technology Companies by Revenue, dtype: int64
To sort in descending mode:
In [51]:
s.sort_values(ascending=False).head()
Out[51]:
Apple 274515 Samsung 200734 Alphabet 182527 Foxconn 181945 Microsoft 143015 Name: Top Technology Companies by Revenue, dtype: int64
In [52]:
s.sort_index(ascending=False).head()
Out[52]:
Tencent 69864 Sony 84893 Samsung 200734 Panasonic 63191 Microsoft 143015 Name: Top Technology Companies by Revenue, dtype: int64
Activities¶
7. What company has the largest revenue?¶
In [57]:
s.sort_values(ascending=False)
Out[57]:
Apple 274515 Samsung 200734 Alphabet 182527 Foxconn 181945 Microsoft 143015 Huawei 129184 Dell Technologies 92224 Meta 85965 Sony 84893 Hitachi 82345 Intel 77867 IBM 73620 Tencent 69864 Panasonic 63191 Name: Top Technology Companies by Revenue, dtype: int64
8. Sort company names lexicographically. Which one comes first?¶
In [58]:
s.sort_index(ascending=True)
Out[58]:
Alphabet 182527 Apple 274515 Dell Technologies 92224 Foxconn 181945 Hitachi 82345 Huawei 129184 IBM 73620 Intel 77867 Meta 85965 Microsoft 143015 Panasonic 63191 Samsung 200734 Sony 84893 Tencent 69864 Name: Top Technology Companies by Revenue, dtype: int64
Immutability¶
Run the sort methods above and check the series again, you'll see that s
has NOT changed:
In [59]:
s.head()
Out[59]:
Apple 274515 Samsung 200734 Alphabet 182527 Foxconn 181945 Microsoft 143015 Name: Top Technology Companies by Revenue, dtype: int64
We will sort the series by revenue, ascending, and we'll mutate the original one. Notice how the method doesn't return anything:
In [60]:
s.sort_values(inplace=True)
But now the series is sorted by revenue in ascending order:
In [61]:
s.head()
Out[61]:
Panasonic 63191 Tencent 69864 IBM 73620 Intel 77867 Hitachi 82345 Name: Top Technology Companies by Revenue, dtype: int64
We'll now sort the series by index, mutating it again:
In [62]:
s.sort_index(inplace=True)
In [63]:
s.head()
Out[63]:
Alphabet 182527 Apple 274515 Dell Technologies 92224 Foxconn 181945 Hitachi 82345 Name: Top Technology Companies by Revenue, dtype: int64
Activities¶
9. Sort American Companies by Revenue¶
In [66]:
american_companies_desc = american_companies.sort_values(ascending=False)
10. Sort (and mutate) international companies¶
In [70]:
# Run this cell to complete the activity
international_companies = s[[
"Sony", "Tencent", "Panasonic",
"Samsung", "Hitachi", "Foxconn", "Huawei"
]]
international_companies.sort_values(ascending=False, inplace=True)
In [72]:
international_companies
Out[72]:
Samsung 200734 Foxconn 181945 Huawei 129184 Sony 84893 Hitachi 82345 Tencent 69864 Panasonic 63191 Name: Top Technology Companies by Revenue, dtype: int64
Modifying series¶
Modifying values:
In [73]:
s['IBM'] = 0
In [74]:
s.sort_values().head()
Out[74]:
IBM 0 Panasonic 63191 Tencent 69864 Intel 77867 Hitachi 82345 Name: Top Technology Companies by Revenue, dtype: int64
Adding elements:
In [75]:
s['Tesla'] = 21450
In [76]:
s.sort_values().head()
Out[76]:
IBM 0 Tesla 21450 Panasonic 63191 Tencent 69864 Intel 77867 Name: Top Technology Companies by Revenue, dtype: int64
Removing elements:
In [77]:
del s['Tesla']
In [78]:
s.sort_values().head()
Out[78]:
IBM 0 Panasonic 63191 Tencent 69864 Intel 77867 Hitachi 82345 Name: Top Technology Companies by Revenue, dtype: int64
Activities¶
11. Insert Amazon's Revenue¶
In [81]:
s['Amazon']=469_822
12. Delete the revenue of Meta¶
In [83]:
del s['Meta']
Concatenating Series¶
We can append series to other series using the .concat()
method:
In [85]:
another_s = pd.Series([21_450, 4_120], index=['Tesla', 'Snapchat'])
In [86]:
another_s
Out[86]:
Tesla 21450 Snapchat 4120 dtype: int64
In [87]:
s_new = pd.concat([s, another_s])
The original series s
is not modified:
In [88]:
s
Out[88]:
Alphabet 182527 Apple 274515 Dell Technologies 92224 Foxconn 181945 Hitachi 82345 Huawei 129184 IBM 0 Intel 77867 Microsoft 143015 Panasonic 63191 Samsung 200734 Sony 84893 Tencent 69864 Amazon 469822 Name: Top Technology Companies by Revenue, dtype: int64
s_new
is the concatenation of s
and s2
:
In [89]:
s_new
Out[89]:
Alphabet 182527 Apple 274515 Dell Technologies 92224 Foxconn 181945 Hitachi 82345 Huawei 129184 IBM 0 Intel 77867 Microsoft 143015 Panasonic 63191 Samsung 200734 Sony 84893 Tencent 69864 Amazon 469822 Tesla 21450 Snapchat 4120 dtype: int64