Statement of Completion#4220fb0c
Data Cleaning with Pandas
medium
Matching Strings by Similarity using Levenshtein distance
Resolution
Activities
Project.ipynb
In [1]:
import itertools
import pandas as pd
# The new library!
from thefuzz import fuzz, process
In [2]:
df1 = pd.read_csv('companies_1.csv')
df2 = pd.read_csv('companies_2.csv')
Data Preprocessing¶
1. Create the df
dataframe containing the product of the two CSVs¶
In [3]:
df = pd.DataFrame(itertools.product(df1['CLIENT'].values, df2['Firm Name'].values), columns = ['CSV 1', 'CSV 2'])
In [4]:
df.head()
Out[4]:
CSV 1 | CSV 2 | |
---|---|---|
0 | Adobe Systems, Inc. | AAA Northern California, Nevada & Utah Auto Ex... |
1 | Adobe Systems, Inc. | ACCO Engineered Systems |
2 | Adobe Systems, Inc. | Adams County Retirement Plan |
3 | Adobe Systems, Inc. | Adidas America, Inc. |
4 | Adobe Systems, Inc. | Adobe Systems, Inc. |
Calculating the Levenshtein distance¶
Now, we will learn how to calculate the Levenshtein distance between two strings. Here we will user partial_ratio
function from the fuzz
module to compute the "ratio" between two strings. The result is a number between 0
and 100
, with 100
indicating a "perfect" match. Please note that partial_ratio
gives ratio of the shortest string length to the longest string length. For example, if the first string is ABC
and the second string is ABDC
, then the ratio will be 4/5 = 0.80
.
In [18]:
fuzz.partial_ratio("Apple", "Apple Inc.")
Out[18]:
100
In [19]:
fuzz.partial_ratio("Microsoft", "Apple Inc.")
Out[19]:
18
In [19]:
fuzz.partial_ratio("Microsoft", "MSFT")
Out[19]:
25
If we have list of strings, we can calculate the Levenshtein distance between each pair of strings in the list.
In [20]:
A = ["Apple", "Alphabet", "Microsoft"]
B = ["MSFT", "Alphabet/Google", "Apple inc."]
Below, we combined the two list A
and B
into a list of tuples companies
using product
function from itertools
module.
Then, we calculated the partial ratio for each pair of strings in the list companies
using partial_ratio
function from fuzz
.
In [21]:
companies = list(itertools.product(A, B))
companies
Out[21]:
[('Apple', 'MSFT'), ('Apple', 'Alphabet/Google'), ('Apple', 'Apple inc.'), ('Alphabet', 'MSFT'), ('Alphabet', 'Alphabet/Google'), ('Alphabet', 'Apple inc.'), ('Microsoft', 'MSFT'), ('Microsoft', 'Alphabet/Google'), ('Microsoft', 'Apple inc.')]
In [22]:
for c1, c2 in companies:
ratio = fuzz.partial_ratio(c1, c2)
print(f"{c1} > {c2}: {ratio}")
Apple > MSFT: 0 Apple > Alphabet/Google: 57 Apple > Apple inc.: 100 Alphabet > MSFT: 0 Alphabet > Alphabet/Google: 100 Alphabet > Apple inc.: 46 Microsoft > MSFT: 40 Microsoft > Alphabet/Google: 29 Microsoft > Apple inc.: 31
You will see the greater the ratio, the more similar the strings are.
2. Create a new column Ratio Score
that contains the distance for all the rows in df
¶
In [5]:
score = [fuzz.partial_ratio(c1, c2) for c1, c2 in df.values]
In [6]:
df['Ratio Score'] = score
In [7]:
df.head()
Out[7]:
CSV 1 | CSV 2 | Ratio Score | |
---|---|---|---|
0 | Adobe Systems, Inc. | AAA Northern California, Nevada & Utah Auto Ex... | 32 |
1 | Adobe Systems, Inc. | ACCO Engineered Systems | 64 |
2 | Adobe Systems, Inc. | Adams County Retirement Plan | 41 |
3 | Adobe Systems, Inc. | Adidas America, Inc. | 50 |
4 | Adobe Systems, Inc. | Adobe Systems, Inc. | 100 |
3. How many rows have a Ratio score of 90
or more?¶
In [11]:
df.loc[
df['Ratio Score'] >= 90
].shape
Out[11]:
(135, 3)
4. What's the corresponding company in CSV2 to AECOM
in CSV1?¶
In [15]:
pd.set_option('display.max_colwidth', None)
In [17]:
df.loc[
(df['CSV 1'] == 'AECOM') &
(df['Ratio Score'] >= 90)
].head(1)
Out[17]:
CSV 1 | CSV 2 | Ratio Score | |
---|---|---|---|
742 | AECOM | AECOM Technology Corporation | 100 |
5. What's the corresponding CSV2 company of Starbucks?¶
In [18]:
df.loc[
(df['CSV 1'] == 'Starbucks') &
(df['Ratio Score'] >= 90)
].head(1)
Out[18]:
CSV 1 | CSV 2 | Ratio Score | |
---|---|---|---|
77948 | Starbucks | Starbucks Corporation | 100 |
6. Is there a matching company for Pinnacle West Capital Corporation
?¶
In [20]:
df.loc[
(df['CSV 1'] == 'Pinnacle West Capital Corporation') &
(df['Ratio Score'] >= 90)
].head(1)
Out[20]:
CSV 1 | CSV 2 | Ratio Score | |
---|---|---|---|
61130 | Pinnacle West Capital Corporation | Ball Corporation | 93 |
7. How many matching companies are there for County of Los Angeles Deferred Compensation Program
?¶
In [22]:
df.loc[
(df['CSV 1'] == 'County of Los Angeles Deferred Compensation Program') &
(df['Ratio Score'] >= 90)
].head()
Out[22]:
CSV 1 | CSV 2 | Ratio Score | |
---|---|---|---|
26206 | County of Los Angeles Deferred Compensation Program | City of Los Angeles Deferred Compensation | 95 |
26227 | County of Los Angeles Deferred Compensation Program | County of Los Angeles Deferred Compensation Program | 100 |
8. Is there a matching company for The Queens Health Systems
?¶
In [23]:
df.loc[
(df['CSV 1'] == 'The Queens Health Systems') &
(df['Ratio Score'] >= 90)
].head()
Out[23]:
CSV 1 | CSV 2 | Ratio Score | |
---|---|---|---|
84220 | The Queens Health Systems | The Queen's Health Systems | 96 |