도은 김 has successfully completed this project.

Data Cleaning with Pandas

medium

4.32

Cleaning Duplicate Data in Military Equipment Records

Finished

November 12, 2024 1:37 AM

Elapsed time (min)

Completed activities

Resolution

Activities

Cleaning Duplicate Data in Military Equipment Records¶

In [1]:

import numpy as np
import pandas as pd

In [2]:

df = pd.read_csv('data/military_equipment_inventory.csv')

In [3]:

df.head()

Out[3]:

	Equipment_ID	Type	Model	Manufacturer	Year_Acquired	Status	Current_Base	Last_Maintenance	Operational_Cost
0	EQ-0001	Armored Vehicle	Model-530	Northrop Grumman	2017	Decommissioned	Outpost Delta	2020-01-01 00:00:00.000000000	147080
1	EQ-0002	Artillery	Model-408	General Dynamics	1997	In Transit	Station Echo	2020-01-02 10:16:32.375366568	804419
2	EQ-0003	Helicopter	Model-236	General Dynamics	1991	Active	Fort Alpha	2020-01-03 20:33:04.750733137	990036
3	EQ-0004	Artillery	Model-301	Northrop Grumman	2011	In Maintenance	Camp Bravo	2020-01-05 06:49:37.126099706	106502
4	EQ-0005	Artillery	Model-535	General Dynamics	2005	Active	Base Charlie	2020-01-06 17:06:09.501466275	930836

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1624 entries, 0 to 1623
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Equipment_ID      1624 non-null   object
 1   Type              1624 non-null   object
 2   Model             1624 non-null   object
 3   Manufacturer      1624 non-null   object
 4   Year_Acquired     1624 non-null   int64 
 5   Status            1624 non-null   object
 6   Current_Base      1624 non-null   object
 7   Last_Maintenance  1624 non-null   object
 8   Operational_Cost  1624 non-null   int64 
dtypes: int64(2), object(7)
memory usage: 114.3+ KB

1. Count exact duplicate rows¶

In [5]:

# Find and count exact duplicates
df.duplicated().sum()

Out[5]:

2. Find `Equipment_ID` with multiple entries¶

In [9]:

# Find duplicate Equipment_ID
duplicate_ids = df['Equipment_ID'].value_counts()[lambda x : x > 1]

3. Identify Duplicate Equipment Entries by `Equipment_ID`¶

In [15]:

# Count and remove duplicate Equipment_ID
duplicates = df['Equipment_ID'].duplicated().sum()

df_cleaned  = df.drop_duplicates(subset = 'Equipment_ID' , keep = 'first')

4. How many duplicates are there?¶

In [ ]:

5. Normalize `Status` and deduplicate rows¶

In [20]:

# Standardize Status values and remove duplicates
df['Status'] = df['Status'].str.lower().str.strip()
df_cleaned = df.drop_duplicates()

6. Calculate `Manufacturer` percentages in `Type`-`Model`-`Year_Acquired` duplicates¶

In [22]:

# Find duplicate rows
duplicates = df[df.duplicated(subset = ['Type','Model','Year_Acquired'] , keep = False)]
# Count manufacturers in duplicates
mfg_counts = duplicates['Manufacturer'].value_counts()
# Calculate percentage of each manufacturer

mfg_pct = (mfg_counts / mfg_counts.sum() * 100).round(2)

7. Which pandas method is used to identify duplicate rows?¶

In [ ]:

8. How can you check for duplicates in a specific column?¶

In [ ]:

9. Identify similar entries excluding `Operational_Cost` and `Last_Maintenance`¶

In [24]:

# Select columns for comparison
columns_to_compare = df.columns.drop(['Operational_Cost','Last_Maintenance'])
# Find near-duplicates

near_duplicates = df[df.duplicated(subset = columns_to_compare , keep = False)]
near_duplicates

Out[24]:

	Equipment_ID	Type	Model	Manufacturer	Year_Acquired	Status	Current_Base	Last_Maintenance	Operational_Cost
3	EQ-0004	Artillery	Model-301	Northrop Grumman	2011	in maintenance	Camp Bravo	2020-01-05 06:49:37.126099706	106502
4	EQ-0005	Artillery	Model-535	General Dynamics	2005	active	Base Charlie	2020-01-06 17:06:09.501466275	930836
10	EQ-0011	Armored Vehicle	Model-343	Raytheon	2010	in maintenance	Camp Bravo	2020-01-15 06:45:23.753665689	701804
18	EQ-0019	Tank	Model-517	General Dynamics	2006	decommissioned	Base Charlie	2020-01-26 16:57:42.756598240	506232
27	EQ-0028	Aircraft	Model-124	Boeing	2004	active	Outpost Delta	2020-02-08 13:26:34.134897361	856090
...	...	...	...	...	...	...	...	...	...
1519	EQ-0201	Aircraft	Model-128	Raytheon	2006	active	Camp Bravo	2020-10-15 15:07:55.073313784	737376
1520	EQ-0849	Tank	Model-307	Northrop Grumman	2017	decommissioned	Outpost Delta	2023-04-28 01:45:34.310850448	894332
1521	EQ-0509	Artillery	Model-259	Lockheed Martin	1993	in maintenance	Station Echo	2021-12-21 12:02:06.686217016	655528
1522	EQ-0322	Aircraft	Model-417	Boeing	2008	in maintenance	Base Charlie	2021-04-04 10:29:12.492668624	823346
1523	EQ-0275	Helicopter	Model-850	Raytheon	2002	decommissioned	Base Charlie	2021-01-23 07:31:50.850439884	617777

644 rows × 9 columns

10. Select most recent `Last_Maintenance` record for each equipment¶

In [27]:

# Define columns for comparison
comparison_cols = ['Equipment_ID', 'Type', 'Model', 'Manufacturer', 'Year_Acquired', 'Status', 'Current_Base']
# Sort the DataFrame by comparison columns and 'Last_Maintenance', then keep the last occurrence
resolved_maintenance  = near_duplicates.sort_values(comparison_cols + ['Last_Maintenance'])
resolved_maintenance = resolved_maintenance.drop_duplicates(subset = comparison_cols, keep = 'last').reset_index(drop = True)

11. When handling near-duplicates by keeping the most recent entry, which operation is typically performed first?¶

In [ ]:

comparison_cols = ['Equipment_ID', 'Type', 'Model', 'Manufacturer', 'Year_Acquired', 'Status', 'Current_Base']

12. What's a potential downside of always keeping the lower `Operational_Cost` entry when resolving near-duplicates?¶

In [ ]:

13. Identify equipment with identical `Type` and `Model`¶

In [29]:

# Filter duplicate rows based on 'Type' and 'Mode
hidden_duplicates = df[df.duplicated(subset = ['Type','Model'] , keep = False)]

14. Count and rank `Type`, `Model` combinations in duplicates¶

In [34]:

# Count and sort occurrences of 'Type' and 'Model'type_model_counts 
type_model_counts = hidden_duplicates.value_counts(subset = ['Type','Model']).sort_values(ascending = False)

15. Which pandas method is most useful for creating a hash of multiple columns?¶

In [ ]:

16. What's a potential risk of always keeping the entry with the lowest Equipment_ID when resolving hidden duplicates?¶

In [ ]:

Save your dataset. You can use the following code snippets depending on the file format you want.

In [70]:

# Save to CSV
# df.to_csv('df_cleaned.csv', index=False)

Statement of Completion#e2ee3a12

Data Cleaning with Pandas

Cleaning Duplicate Data in Military Equipment Records

Cleaning Duplicate Data in Military Equipment Records¶

1. Count exact duplicate rows¶

2. Find Equipment_ID with multiple entries¶

3. Identify Duplicate Equipment Entries by Equipment_ID¶

4. How many duplicates are there?¶

5. Normalize Status and deduplicate rows¶

6. Calculate Manufacturer percentages in Type-Model-Year_Acquired duplicates¶

7. Which pandas method is used to identify duplicate rows?¶

8. How can you check for duplicates in a specific column?¶

9. Identify similar entries excluding Operational_Cost and Last_Maintenance¶

10. Select most recent Last_Maintenance record for each equipment¶

11. When handling near-duplicates by keeping the most recent entry, which operation is typically performed first?¶

12. What's a potential downside of always keeping the lower Operational_Cost entry when resolving near-duplicates?¶

13. Identify equipment with identical Type and Model¶

14. Count and rank Type, Model combinations in duplicates¶

15. Which pandas method is most useful for creating a hash of multiple columns?¶

16. What's a potential risk of always keeping the entry with the lowest Equipment_ID when resolving hidden duplicates?¶

2. Find `Equipment_ID` with multiple entries¶

3. Identify Duplicate Equipment Entries by `Equipment_ID`¶

5. Normalize `Status` and deduplicate rows¶

6. Calculate `Manufacturer` percentages in `Type`-`Model`-`Year_Acquired` duplicates¶

9. Identify similar entries excluding `Operational_Cost` and `Last_Maintenance`¶

10. Select most recent `Last_Maintenance` record for each equipment¶

12. What's a potential downside of always keeping the lower `Operational_Cost` entry when resolving near-duplicates?¶

13. Identify equipment with identical `Type` and `Model`¶

14. Count and rank `Type`, `Model` combinations in duplicates¶