Saratean Tudor has successfully completed this project.

Data Cleaning with Pandas

medium

4.68

Matching Strings by Similarity using Levenshtein distance

Finished

May 24, 2025 9:43 PM

Elapsed time (min)

Completed activities

Resolution

Activities

Project.ipynb

Notebook

In [1]:

import itertools
import pandas as pd

# The new library!
from thefuzz import fuzz, process

In [2]:

df1 = pd.read_csv('companies_1.csv')
df2 = pd.read_csv('companies_2.csv')

Data Preprocessing¶

1. Create the `df` dataframe containing the product of the two CSVs¶

In [3]:

df = pd.DataFrame(itertools.product(df1['CLIENT'].values, df2['Firm Name'].values), columns = ['CSV 1', 'CSV 2'])

In [4]:

df.head()

Out[4]:

	CSV 1	CSV 2
0	Adobe Systems, Inc.	AAA Northern California, Nevada & Utah Auto Ex...
1	Adobe Systems, Inc.	ACCO Engineered Systems
2	Adobe Systems, Inc.	Adams County Retirement Plan
3	Adobe Systems, Inc.	Adidas America, Inc.
4	Adobe Systems, Inc.	Adobe Systems, Inc.

Calculating the Levenshtein distance¶

Now, we will learn how to calculate the Levenshtein distance between two strings. Here we will user partial_ratio function from the fuzz module to compute the "ratio" between two strings. The result is a number between 0 and 100, with 100 indicating a "perfect" match. Please note that partial_ratio gives ratio of the shortest string length to the longest string length. For example, if the first string is ABC and the second string is ABDC, then the ratio will be 4/5 = 0.80.

In [18]:

fuzz.partial_ratio("Apple", "Apple Inc.")

Out[18]:

In [19]:

fuzz.partial_ratio("Microsoft", "Apple Inc.")

Out[19]:

In [19]:

fuzz.partial_ratio("Microsoft", "MSFT")

Out[19]:

If we have list of strings, we can calculate the Levenshtein distance between each pair of strings in the list.

In [20]:

A = ["Apple", "Alphabet", "Microsoft"]
B = ["MSFT", "Alphabet/Google", "Apple inc."]

Below, we combined the two list A and B into a list of tuples companies using product function from itertools module.

Then, we calculated the partial ratio for each pair of strings in the list companies using partial_ratio function from fuzz.

In [21]:

companies = list(itertools.product(A, B))
companies

Out[21]:

[('Apple', 'MSFT'),
 ('Apple', 'Alphabet/Google'),
 ('Apple', 'Apple inc.'),
 ('Alphabet', 'MSFT'),
 ('Alphabet', 'Alphabet/Google'),
 ('Alphabet', 'Apple inc.'),
 ('Microsoft', 'MSFT'),
 ('Microsoft', 'Alphabet/Google'),
 ('Microsoft', 'Apple inc.')]

In [22]:

for c1, c2 in companies:
    ratio = fuzz.partial_ratio(c1, c2)
    print(f"{c1} > {c2}: {ratio}")

Apple > MSFT: 0
Apple > Alphabet/Google: 57
Apple > Apple inc.: 100
Alphabet > MSFT: 0
Alphabet > Alphabet/Google: 100
Alphabet > Apple inc.: 46
Microsoft > MSFT: 40
Microsoft > Alphabet/Google: 29
Microsoft > Apple inc.: 31

You will see the greater the ratio, the more similar the strings are.

2. Create a new column `Ratio Score` that contains the distance for all the rows in `df`¶

In [5]:

score = [fuzz.partial_ratio(c1, c2) for c1, c2 in df.values]

In [6]:

df['Ratio Score'] = score

In [7]:

df.head()

Out[7]:

	CSV 1	CSV 2	Ratio Score
0	Adobe Systems, Inc.	AAA Northern California, Nevada & Utah Auto Ex...	32
1	Adobe Systems, Inc.	ACCO Engineered Systems	64
2	Adobe Systems, Inc.	Adams County Retirement Plan	41
3	Adobe Systems, Inc.	Adidas America, Inc.	50
4	Adobe Systems, Inc.	Adobe Systems, Inc.	100

3. How many rows have a Ratio score of `90` or more?¶

In [11]:

df.loc[
    df['Ratio Score'] >= 90
].shape

Out[11]:

(135, 3)

4. What's the corresponding company in CSV2 to `AECOM` in CSV1?¶

In [15]:

pd.set_option('display.max_colwidth', None)

In [17]:

df.loc[
    (df['CSV 1'] == 'AECOM') &
    (df['Ratio Score'] >= 90)
].head(1)

Out[17]:

	CSV 1	CSV 2	Ratio Score
742	AECOM	AECOM Technology Corporation	100

5. What's the corresponding CSV2 company of Starbucks?¶

In [18]:

df.loc[
    (df['CSV 1'] == 'Starbucks') &
    (df['Ratio Score'] >= 90)
].head(1)

Out[18]:

	CSV 1	CSV 2	Ratio Score
77948	Starbucks	Starbucks Corporation	100

6. Is there a matching company for `Pinnacle West Capital Corporation`?¶

In [20]:

df.loc[
    (df['CSV 1'] == 'Pinnacle West Capital Corporation') &
    (df['Ratio Score'] >= 90)
].head(1)

Out[20]:

	CSV 1	CSV 2	Ratio Score
61130	Pinnacle West Capital Corporation	Ball Corporation	93

7. How many matching companies are there for `County of Los Angeles Deferred Compensation Program`?¶

In [22]:

df.loc[
    (df['CSV 1'] == 'County of Los Angeles Deferred Compensation Program') &
    (df['Ratio Score'] >= 90)
].head()

Out[22]:

	CSV 1	CSV 2	Ratio Score
26206	County of Los Angeles Deferred Compensation Program	City of Los Angeles Deferred Compensation	95
26227	County of Los Angeles Deferred Compensation Program	County of Los Angeles Deferred Compensation Program	100

8. Is there a matching company for `The Queens Health Systems`?¶

In [23]:

df.loc[
    (df['CSV 1'] == 'The Queens Health Systems') &
    (df['Ratio Score'] >= 90)
].head()

Out[23]:

	CSV 1	CSV 2	Ratio Score
84220	The Queens Health Systems	The Queen's Health Systems	96

Statement of Completion#4220fb0c

Data Cleaning with Pandas

Matching Strings by Similarity using Levenshtein distance

Data Preprocessing¶

1. Create the `df` dataframe containing the product of the two CSVs¶

Calculating the Levenshtein distance¶

2. Create a new column `Ratio Score` that contains the distance for all the rows in `df`¶

3. How many rows have a Ratio score of `90` or more?¶

4. What's the corresponding company in CSV2 to `AECOM` in CSV1?¶

5. What's the corresponding CSV2 company of Starbucks?¶

6. Is there a matching company for `Pinnacle West Capital Corporation`?¶

7. How many matching companies are there for `County of Los Angeles Deferred Compensation Program`?¶

8. Is there a matching company for `The Queens Health Systems`?¶

The End!¶

Statement of Completion#4220fb0c

Data Cleaning with Pandas

Matching Strings by Similarity using Levenshtein distance

Data Preprocessing¶

1. Create the df dataframe containing the product of the two CSVs¶

Calculating the Levenshtein distance¶

2. Create a new column Ratio Score that contains the distance for all the rows in df¶

3. How many rows have a Ratio score of 90 or more?¶

4. What's the corresponding company in CSV2 to AECOM in CSV1?¶

5. What's the corresponding CSV2 company of Starbucks?¶

6. Is there a matching company for Pinnacle West Capital Corporation?¶

7. How many matching companies are there for County of Los Angeles Deferred Compensation Program?¶

8. Is there a matching company for The Queens Health Systems?¶

The End!¶

1. Create the `df` dataframe containing the product of the two CSVs¶

2. Create a new column `Ratio Score` that contains the distance for all the rows in `df`¶

3. How many rows have a Ratio score of `90` or more?¶

4. What's the corresponding company in CSV2 to `AECOM` in CSV1?¶

6. Is there a matching company for `Pinnacle West Capital Corporation`?¶

7. How many matching companies are there for `County of Los Angeles Deferred Compensation Program`?¶

8. Is there a matching company for `The Queens Health Systems`?¶