Statement of Completion#e2ee3a12
Data Cleaning with Pandas
medium
Cleaning Duplicate Data in Military Equipment Records
Resolution
Activities
Cleaning Duplicate Data in Military Equipment Records¶
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.read_csv('data/military_equipment_inventory.csv')
In [3]:
df.head()
Out[3]:
Equipment_ID | Type | Model | Manufacturer | Year_Acquired | Status | Current_Base | Last_Maintenance | Operational_Cost | |
---|---|---|---|---|---|---|---|---|---|
0 | EQ-0001 | Armored Vehicle | Model-530 | Northrop Grumman | 2017 | Decommissioned | Outpost Delta | 2020-01-01 00:00:00.000000000 | 147080 |
1 | EQ-0002 | Artillery | Model-408 | General Dynamics | 1997 | In Transit | Station Echo | 2020-01-02 10:16:32.375366568 | 804419 |
2 | EQ-0003 | Helicopter | Model-236 | General Dynamics | 1991 | Active | Fort Alpha | 2020-01-03 20:33:04.750733137 | 990036 |
3 | EQ-0004 | Artillery | Model-301 | Northrop Grumman | 2011 | In Maintenance | Camp Bravo | 2020-01-05 06:49:37.126099706 | 106502 |
4 | EQ-0005 | Artillery | Model-535 | General Dynamics | 2005 | Active | Base Charlie | 2020-01-06 17:06:09.501466275 | 930836 |
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1624 entries, 0 to 1623 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Equipment_ID 1624 non-null object 1 Type 1624 non-null object 2 Model 1624 non-null object 3 Manufacturer 1624 non-null object 4 Year_Acquired 1624 non-null int64 5 Status 1624 non-null object 6 Current_Base 1624 non-null object 7 Last_Maintenance 1624 non-null object 8 Operational_Cost 1624 non-null int64 dtypes: int64(2), object(7) memory usage: 114.3+ KB
1. Count exact duplicate rows¶
In [5]:
# Find and count exact duplicates
df.duplicated().sum()
Out[5]:
177
2. Find Equipment_ID
with multiple entries¶
In [9]:
# Find duplicate Equipment_ID
duplicate_ids = df['Equipment_ID'].value_counts()[lambda x : x > 1]
3. Identify Duplicate Equipment Entries by Equipment_ID
¶
In [15]:
# Count and remove duplicate Equipment_ID
duplicates = df['Equipment_ID'].duplicated().sum()
df_cleaned = df.drop_duplicates(subset = 'Equipment_ID' , keep = 'first')
4. How many duplicates are there?¶
In [ ]:
5. Normalize Status
and deduplicate rows¶
In [20]:
# Standardize Status values and remove duplicates
df['Status'] = df['Status'].str.lower().str.strip()
df_cleaned = df.drop_duplicates()
6. Calculate Manufacturer
percentages in Type
-Model
-Year_Acquired
duplicates¶
In [22]:
# Find duplicate rows
duplicates = df[df.duplicated(subset = ['Type','Model','Year_Acquired'] , keep = False)]
# Count manufacturers in duplicates
mfg_counts = duplicates['Manufacturer'].value_counts()
# Calculate percentage of each manufacturer
mfg_pct = (mfg_counts / mfg_counts.sum() * 100).round(2)
7. Which pandas method is used to identify duplicate rows?¶
In [ ]:
8. How can you check for duplicates in a specific column?¶
In [ ]:
9. Identify similar entries excluding Operational_Cost
and Last_Maintenance
¶
In [24]:
# Select columns for comparison
columns_to_compare = df.columns.drop(['Operational_Cost','Last_Maintenance'])
# Find near-duplicates
near_duplicates = df[df.duplicated(subset = columns_to_compare , keep = False)]
near_duplicates
Out[24]:
Equipment_ID | Type | Model | Manufacturer | Year_Acquired | Status | Current_Base | Last_Maintenance | Operational_Cost | |
---|---|---|---|---|---|---|---|---|---|
3 | EQ-0004 | Artillery | Model-301 | Northrop Grumman | 2011 | in maintenance | Camp Bravo | 2020-01-05 06:49:37.126099706 | 106502 |
4 | EQ-0005 | Artillery | Model-535 | General Dynamics | 2005 | active | Base Charlie | 2020-01-06 17:06:09.501466275 | 930836 |
10 | EQ-0011 | Armored Vehicle | Model-343 | Raytheon | 2010 | in maintenance | Camp Bravo | 2020-01-15 06:45:23.753665689 | 701804 |
18 | EQ-0019 | Tank | Model-517 | General Dynamics | 2006 | decommissioned | Base Charlie | 2020-01-26 16:57:42.756598240 | 506232 |
27 | EQ-0028 | Aircraft | Model-124 | Boeing | 2004 | active | Outpost Delta | 2020-02-08 13:26:34.134897361 | 856090 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1519 | EQ-0201 | Aircraft | Model-128 | Raytheon | 2006 | active | Camp Bravo | 2020-10-15 15:07:55.073313784 | 737376 |
1520 | EQ-0849 | Tank | Model-307 | Northrop Grumman | 2017 | decommissioned | Outpost Delta | 2023-04-28 01:45:34.310850448 | 894332 |
1521 | EQ-0509 | Artillery | Model-259 | Lockheed Martin | 1993 | in maintenance | Station Echo | 2021-12-21 12:02:06.686217016 | 655528 |
1522 | EQ-0322 | Aircraft | Model-417 | Boeing | 2008 | in maintenance | Base Charlie | 2021-04-04 10:29:12.492668624 | 823346 |
1523 | EQ-0275 | Helicopter | Model-850 | Raytheon | 2002 | decommissioned | Base Charlie | 2021-01-23 07:31:50.850439884 | 617777 |
644 rows × 9 columns
10. Select most recent Last_Maintenance
record for each equipment¶
In [27]:
# Define columns for comparison
comparison_cols = ['Equipment_ID', 'Type', 'Model', 'Manufacturer', 'Year_Acquired', 'Status', 'Current_Base']
# Sort the DataFrame by comparison columns and 'Last_Maintenance', then keep the last occurrence
resolved_maintenance = near_duplicates.sort_values(comparison_cols + ['Last_Maintenance'])
resolved_maintenance = resolved_maintenance.drop_duplicates(subset = comparison_cols, keep = 'last').reset_index(drop = True)
11. When handling near-duplicates by keeping the most recent entry, which operation is typically performed first?¶
In [ ]:
comparison_cols = ['Equipment_ID', 'Type', 'Model', 'Manufacturer', 'Year_Acquired', 'Status', 'Current_Base']
12. What's a potential downside of always keeping the lower Operational_Cost
entry when resolving near-duplicates?¶
In [ ]:
13. Identify equipment with identical Type
and Model
¶
In [29]:
# Filter duplicate rows based on 'Type' and 'Mode
hidden_duplicates = df[df.duplicated(subset = ['Type','Model'] , keep = False)]
14. Count and rank Type
, Model
combinations in duplicates¶
In [34]:
# Count and sort occurrences of 'Type' and 'Model'type_model_counts
type_model_counts = hidden_duplicates.value_counts(subset = ['Type','Model']).sort_values(ascending = False)
15. Which pandas method is most useful for creating a hash of multiple columns?¶
In [ ]:
16. What's a potential risk of always keeping the entry with the lowest Equipment_ID when resolving hidden duplicates?¶
In [ ]:
Save your dataset. You can use the following code snippets depending on the file format you want.
In [70]:
# Save to CSV
# df.to_csv('df_cleaned.csv', index=False)