Statement of Completion#4b003585
Introduction to Supervised Learning with scikit-learn
easy
Dealing with duplicated data
Resolution
Activities
Project.ipynb
In [1]:
import pandas as pd
import numpy as np
2) Here's an example of Python code that illustrates how to identify and handle duplicate entries in a dataset using the popular pandas library.¶
In [2]:
import pandas as pd
# Sample dataset with duplicate entries
data = {
'Customer_ID': [1, 2, 3, 4, 1, 5, 2, 6],
'Product_ID': ['A', 'B', 'C', 'D', 'A', 'E', 'B', 'F'],
'Review_Text': ['Great product', 'Good product', 'Average', 'Not satisfied', 'Great product', 'Excellent', 'Good product', 'Very satisfied'],
'Rating': [5, 4, 3, 2, 5, 5, 4, 5]
}
df = pd.DataFrame(data)
# Identifying and handling duplicates
duplicate_rows = df[df.duplicated(['Customer_ID', 'Product_ID', 'Review_Text'], keep='first')]
df_cleaned = df.drop_duplicates(['Customer_ID', 'Product_ID', 'Review_Text'], keep='first')
# Displaying the original and cleaned datasets
print("Original Dataset:")
print(df)
print("\nDuplicate Rows:")
print(duplicate_rows)
print("\nCleaned Dataset:")
print(df_cleaned)
Original Dataset: Customer_ID Product_ID Review_Text Rating 0 1 A Great product 5 1 2 B Good product 4 2 3 C Average 3 3 4 D Not satisfied 2 4 1 A Great product 5 5 5 E Excellent 5 6 2 B Good product 4 7 6 F Very satisfied 5 Duplicate Rows: Customer_ID Product_ID Review_Text Rating 4 1 A Great product 5 6 2 B Good product 4 Cleaned Dataset: Customer_ID Product_ID Review_Text Rating 0 1 A Great product 5 1 2 B Good product 4 2 3 C Average 3 3 4 D Not satisfied 2 5 5 E Excellent 5 7 6 F Very satisfied 5
3) Let's import the dataset Calls_for_Service_2015.csv
and explote it. Check the structure of the data by inspecting the number of rows, columns, and attributes of the dataset. You can use functions like .shape, .head(), .info(), and .describe() to get a quick overview of the data.¶
In [3]:
df=pd.read_csv('Calls_for_Service_2015.csv')
4) Check if any of the record of the dataset is duplicated¶
In [5]:
df.duplicated().sum()
Out[5]:
9
5) Let's check for duplicates in a specific column of a given DataFrame.¶
In [ ]:
6) Your task is to drop duplicate rows from a given DataFrame and retain the first occurrence of each duplicated row. Select the correct code.¶
In [ ]:
Strip Solutions.ipynb
In [12]:
import re
import json
from pathlib import Path
In [13]:
pattern = re.compile("(#*)\s*")
In [14]:
def is_control_cell(source):
control_lines = ["solution", "assertion"]
for control_line in control_lines:
if source.lower().startswith(control_line):
return True
return False
Configurations:
In [19]:
CHECK_OVERWRITE = True
SOLUTION_NOTEBOOK_NAME = 'Solution.ipynb'
NEW_NOTEBOOK_NAME = 'Project.ipynb'
In [27]:
assert Path(SOLUTION_NOTEBOOK_NAME).exists(), f"The solution notebook '{SOLUTION_NOTEBOOK_NAME}' doesn't exist"
with open(SOLUTION_NOTEBOOK_NAME) as fp:
notebook = json.load(fp)
cells = notebook['cells']
new_cells = []
control_cut = False
for cell in cells:
cell_type = cell['cell_type']
if control_cut:
if cell_type == 'markdown' and cell['source'] and pattern.match(cell['source'][0]):
control_cut = False
else:
continue
if cell_type == 'markdown' and cell['source']:
if is_control_cell(cell['source'][0]):
control_cut = True
continue
new_cells.append(cell)
notebook['cells'] = new_cells
if Path(NEW_NOTEBOOK_NAME).exists():
answer = input(f"You're about to overwrite {NEW_NOTEBOOK_NAME}. Are you sure? (y/N)") or 'n'
if answer.lower() != 'y':
assert False, "Cancelled."
with open(NEW_NOTEBOOK_NAME, 'w') as fp:
json.dump(notebook, fp, indent=2)
print(f"\nNew notebook saved in: {NEW_NOTEBOOK_NAME}")
New notebook saved in: Project.ipynb
In [ ]:
Solutions.ipynb
In [ ]:
In [71]:
import pandas as pd
In [72]:
df=pd.read_csv('Calls_for_Service_2015.csv')
In [64]:
df.head()
Out[64]:
NOPD_Item | Type_ | TypeText | Priority | InitialType | InitialTypeText | InitialPriority | MapX | MapY | TimeCreate | ... | TimeArrive | TimeClosed | Disposition | DispositionText | SelfInitiated | Beat | BLOCK_ADDRESS | Zip | PoliceDistrict | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A0001015 | 94 | DISCHARGING FIREARM | 2B | 94 | DISCHARGING FIREARM | 2B | 3668626 | 532154 | 01/01/2015 12:03:43 AM | ... | 01/01/2015 12:09:18 AM | 01/01/2015 12:15:09 AM | UNF | UNFOUNDED | N | 2U03 | College Ct & Earhart Blvd | 70125.0 | 2 | (29.95762393, -90.10869788) |
1 | A0001115 | 94 | DISCHARGING FIREARM | 1A | 94 | DISCHARGING FIREARM | 2B | 3668441 | 538345 | 01/01/2015 12:03:59 AM | ... | NaN | 01/01/2015 03:30:05 AM | UNF | UNFOUNDED | N | 3D02 | Baudin St & S Olympia St | 70119.0 | 3 | (29.97465231, -90.10907539) |
2 | A0001215 | 94 | DISCHARGING FIREARM | 1A | 94 | DISCHARGING FIREARM | 2B | 3684578 | 550848 | 01/01/2015 12:04:15 AM | ... | NaN | 01/01/2015 03:30:06 AM | UNF | UNFOUNDED | N | 3T01 | 049XX Mandeville St | 70122.0 | 3 | (30.00854815, -90.05766993) |
3 | A0001415 | 94 | DISCHARGING FIREARM | 1A | 94 | DISCHARGING FIREARM | 2B | 3677064 | 545305 | 01/01/2015 12:05:08 AM | ... | NaN | 01/01/2015 03:30:06 AM | UNF | UNFOUNDED | N | 3M01 | Sere St & Cadillac St | 70122.0 | 3 | (29.99353541, -90.08160198) |
4 | A0001515 | 94F | FIREWORKS | 1A | 94F | FIREWORKS | 1A | 3687469 | 540996 | 01/01/2015 12:05:37 AM | ... | NaN | 01/01/2015 12:34:59 AM | NAT | Necessary Action Taken | N | 5L02 | 023XX Franklin Ave | 70117.0 | 5 | (29.98137127, -90.04888436) |
5 rows × 21 columns
In [56]:
df_1=df.iloc[1:10,:]
df_1
Out[56]:
NOPD_Item | Type_ | TypeText | Priority | InitialType | InitialTypeText | InitialPriority | MapX | MapY | TimeCreate | ... | TimeArrive | TimeClosed | Disposition | DispositionText | SelfInitiated | Beat | BLOCK_ADDRESS | Zip | PoliceDistrict | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | A0001015 | 94 | DISCHARGING FIREARM | 2B | 94 | DISCHARGING FIREARM | 2B | 3668626 | 532154 | 01/01/2015 12:03:43 AM | ... | 01/01/2015 12:09:18 AM | 01/01/2015 12:15:09 AM | UNF | UNFOUNDED | N | 2U03 | College Ct & Earhart Blvd | 70125.0 | 2 | (29.95762393, -90.10869788) |
2 | A0001115 | 94 | DISCHARGING FIREARM | 1A | 94 | DISCHARGING FIREARM | 2B | 3668441 | 538345 | 01/01/2015 12:03:59 AM | ... | NaN | 01/01/2015 03:30:05 AM | UNF | UNFOUNDED | N | 3D02 | Baudin St & S Olympia St | 70119.0 | 3 | (29.97465231, -90.10907539) |
3 | A0001215 | 94 | DISCHARGING FIREARM | 1A | 94 | DISCHARGING FIREARM | 2B | 3684578 | 550848 | 01/01/2015 12:04:15 AM | ... | NaN | 01/01/2015 03:30:06 AM | UNF | UNFOUNDED | N | 3T01 | 049XX Mandeville St | 70122.0 | 3 | (30.00854815, -90.05766993) |
4 | A0001415 | 94 | DISCHARGING FIREARM | 1A | 94 | DISCHARGING FIREARM | 2B | 3677064 | 545305 | 01/01/2015 12:05:08 AM | ... | NaN | 01/01/2015 03:30:06 AM | UNF | UNFOUNDED | N | 3M01 | Sere St & Cadillac St | 70122.0 | 3 | (29.99353541, -90.08160198) |
5 | A0001515 | 94F | FIREWORKS | 1A | 94F | FIREWORKS | 1A | 3687469 | 540996 | 01/01/2015 12:05:37 AM | ... | NaN | 01/01/2015 12:34:59 AM | NAT | Necessary Action Taken | N | 5L02 | 023XX Franklin Ave | 70117.0 | 5 | (29.98137127, -90.04888436) |
6 | A0001615 | 62A | BURGLAR ALARM, SILENT | 2C | 62A | BURGLAR ALARM, SILENT | 2C | 3691088 | 534199 | 01/01/2015 12:05:51 AM | ... | 01/01/2015 12:23:51 AM | 01/01/2015 12:41:38 AM | NAT | Necessary Action Taken | N | 5D01 | 038XX Dauphine St | 70117.0 | 5 | (29.96256866, -90.03769629) |
7 | A0001715 | 21 | COMPLAINT OTHER | 1H | 21 | COMPLAINT OTHER | 1H | 3679606 | 522244 | 01/01/2015 12:06:04 AM | ... | NaN | 01/01/2015 12:33:12 AM | NAT | Necessary Action Taken | N | 6B03 | 020XX Constance St | 70130.0 | 6 | (29.9300485, -90.0743713) |
8 | A0001815 | 94 | DISCHARGING FIREARM | 2B | 94 | DISCHARGING FIREARM | 2B | 3683357 | 551103 | 01/01/2015 12:06:10 AM | ... | 01/01/2015 12:22:21 AM | 01/01/2015 12:30:46 AM | UNF | UNFOUNDED | N | 3Q02 | 050XX Frenchmen St | 70122.0 | 3 | (30.00928759, -90.06151892) |
9 | A0001915 | 94 | DISCHARGING FIREARM | 2B | 94 | DISCHARGING FIREARM | 2B | 3695450 | 557410 | 01/01/2015 12:06:43 AM | ... | 01/01/2015 12:26:00 AM | 01/01/2015 12:58:18 AM | UNF | UNFOUNDED | N | 7D01 | 060XX Morrison Rd | 70126.0 | 7 | (30.02625512, -90.02308585) |
9 rows × 21 columns
In [57]:
df_3=df.iloc[1:10000,:]
In [58]:
df_2=pd.concat([df_1, df_3], ignore_index=True).reset_index(drop=True)
In [ ]:
df_2.head()
In [60]:
df_2.to_csv('Calls_for_Service_2015.csv',index=False)
In [49]:
df_2.shape
Out[49]:
(10008, 21)
In [65]:
df.duplicated().sum()
Out[65]:
9
In [66]:
df.duplicated('TypeText').sum()
Out[66]:
9902
In [67]:
df.duplicated('TypeText').sum()
Out[67]:
9902
In [73]:
df.duplicated('NOPD_Item').sum()
Out[73]:
9
In [68]:
df.drop_duplicates( keep='first',inplace=True)
In [ ]: