Statement of Completion#bd4ea05a
Data Cleaning with Pandas
medium
Cleaning and Standardizing Pharmaceutical Product Data
Resolution
Activities
Cleaning and Standardizing Pharmaceutical Product Data¶
Data Loading and Exploration¶
In [68]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [153]:
df = pd.read_csv('data/unclean.csv', low_memory=False)
In [70]:
df.head()
Out[70]:
PRODUCT_ID | PRODUCT_NDC | PRODUCT_TYPE_NAME | NON_PROPRIETARY_NAME | DOSAGEFORM_NAME | START_MARKETING_DATE | END_MARKETING_DATE | MARKETING_CATEGORY_NAME | LABELER_NAME | SUBSTANCE_NAME | ACTIVE_NUMERATOR_STRENGTH | ACTIVE_INGRED_UNIT | DEA_SCHEDULE | LISTING_RECORD_CERTIFIED_THROUGH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0002-0013_15810e26-b479-41b8-9d13-9dc503a71a2a | 0002-0013 | DRUG FOR FURTHER PROCESSING | Insulin human | INJECTION, SOLUTION | 19830627 | NaN | DRUG FOR FURTHER PROCESSING | Eli Lilly and Company | INSULIN HUMAN | 100.0 | [iU]/mL | NaN | 20241231.0 |
1 | 0002-0095_15810e26-b479-41b8-9d13-9dc503a71a2a | 0002-0095 | DRUG FOR FURTHER PROCESSING | Insulin human | INJECTION, SOLUTION | 20151229 | NaN | DRUG FOR FURTHER PROCESSING | Eli Lilly and Company | INSULIN HUMAN | 500.0 | [iU]/mL | NaN | 20241231.0 |
2 | 0002-0096_15810e26-b479-41b8-9d13-9dc503a71a2a | 0002-0096 | DRUG FOR FURTHER PROCESSING | Insulin human | INJECTION, SOLUTION | 19940331 | NaN | DRUG FOR FURTHER PROCESSING | Eli Lilly and Company | INSULIN HUMAN | 500.0 | [iU]/mL | NaN | 20241231.0 |
3 | 0002-0113_55aa4078-e1e8-4338-b2b5-a9298b3c16e2 | 0002-0113 | DRUG FOR FURTHER PROCESSING | Insulin human | INJECTION, SUSPENSION | 19830627 | NaN | DRUG FOR FURTHER PROCESSING | Eli Lilly and Company | INSULIN HUMAN | 100.0 | [iU]/mL | NaN | 20241231.0 |
4 | 0002-0119_55aa4078-e1e8-4338-b2b5-a9298b3c16e2 | 0002-0119 | DRUG FOR FURTHER PROCESSING | Insulin human | INJECTION, SUSPENSION | 20131107 | NaN | DRUG FOR FURTHER PROCESSING | Eli Lilly and Company | INSULIN HUMAN | 100.0 | [iU]/mL | NaN | 20241231.0 |
In [71]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 19591 entries, 0 to 19590 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PRODUCT_ID 19591 non-null object 1 PRODUCT_NDC 19591 non-null object 2 PRODUCT_TYPE_NAME 19591 non-null object 3 NON_PROPRIETARY_NAME 19591 non-null object 4 DOSAGEFORM_NAME 19591 non-null object 5 START_MARKETING_DATE 19591 non-null int64 6 END_MARKETING_DATE 921 non-null float64 7 MARKETING_CATEGORY_NAME 19591 non-null object 8 LABELER_NAME 19591 non-null object 9 SUBSTANCE_NAME 19544 non-null object 10 ACTIVE_NUMERATOR_STRENGTH 18342 non-null float64 11 ACTIVE_INGRED_UNIT 19544 non-null object 12 DEA_SCHEDULE 1077 non-null object 13 LISTING_RECORD_CERTIFIED_THROUGH 18670 non-null float64 dtypes: float64(3), int64(1), object(10) memory usage: 2.1+ MB
In [6]:
df.describe()
Out[6]:
START_MARKETING_DATE | END_MARKETING_DATE | ACTIVE_NUMERATOR_STRENGTH | LISTING_RECORD_CERTIFIED_THROUGH | |
---|---|---|---|---|
count | 1.959100e+04 | 9.210000e+02 | 1.834200e+04 | 1.867000e+04 |
mean | 2.013771e+07 | 2.025381e+07 | 2.503277e+04 | 2.024264e+07 |
std | 9.263274e+04 | 1.054916e+04 | 1.373119e+06 | 3.556366e+03 |
min | 1.940010e+07 | 2.024053e+07 | 2.000000e-03 | 2.017123e+07 |
25% | 2.010096e+07 | 2.025012e+07 | 1.000000e+00 | 2.024123e+07 |
50% | 2.016042e+07 | 2.025073e+07 | 1.000000e+00 | 2.024123e+07 |
75% | 2.020093e+07 | 2.026082e+07 | 1.500000e+01 | 2.024123e+07 |
max | 2.024053e+07 | 2.030123e+07 | 1.000000e+08 | 2.025123e+07 |
1. Identify Missing and Null Values¶
In [72]:
df.isnull().sum()
Out[72]:
PRODUCT_ID 0 PRODUCT_NDC 0 PRODUCT_TYPE_NAME 0 NON_PROPRIETARY_NAME 0 DOSAGEFORM_NAME 0 START_MARKETING_DATE 0 END_MARKETING_DATE 18670 MARKETING_CATEGORY_NAME 0 LABELER_NAME 0 SUBSTANCE_NAME 47 ACTIVE_NUMERATOR_STRENGTH 1249 ACTIVE_INGRED_UNIT 47 DEA_SCHEDULE 18514 LISTING_RECORD_CERTIFIED_THROUGH 921 dtype: int64
In [73]:
# write your code
missing_values = df.isnull().sum()
2. Missing Values in END_MARKETING_DATE
Column¶
In [74]:
missing_values['END_MARKETING_DATE']
Out[74]:
18670
Visualizing Missing Data¶
Let's visualize the missing or null values in a heatmap using Seaborn and Matplotlib. Don't worry if you don't know how to plot; we have provided the code for you. You will practice plotting using Matplotlib and Seaborn in the next set of skill tracks. Run the cell below.
In [75]:
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Visualize missing values using a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Heatmap of Missing Values')
plt.show()
You can see the yellow parts in the END_MARKETING_DATE
column and other columns. These indicate missing or null values.
3. Identify Numerical Columns¶
In [154]:
df.select_dtypes(include = ['float64', 'int64'])
Out[154]:
START_MARKETING_DATE | END_MARKETING_DATE | ACTIVE_NUMERATOR_STRENGTH | LISTING_RECORD_CERTIFIED_THROUGH | |
---|---|---|---|---|
0 | 19830627 | NaN | 100.0 | 20241231.0 |
1 | 20151229 | NaN | 500.0 | 20241231.0 |
2 | 19940331 | NaN | 500.0 | 20241231.0 |
3 | 19830627 | NaN | 100.0 | 20241231.0 |
4 | 20131107 | NaN | 100.0 | 20241231.0 |
... | ... | ... | ... | ... |
19586 | 20230816 | NaN | 1.0 | 20241231.0 |
19587 | 20230915 | NaN | 1.0 | 20241231.0 |
19588 | 20230915 | NaN | 1.0 | 20241231.0 |
19589 | 20230925 | NaN | 1.0 | 20241231.0 |
19590 | 20210730 | NaN | 300.0 | 20241231.0 |
19591 rows × 4 columns
In [155]:
# Write your code
numerical_cols = df.select_dtypes(include = ['float64', 'int64']).columns
In [78]:
numerical_cols
Out[78]:
Index(['START_MARKETING_DATE', 'END_MARKETING_DATE', 'ACTIVE_NUMERATOR_STRENGTH', 'LISTING_RECORD_CERTIFIED_THROUGH'], dtype='object')
4. Impute Missing Values in Numerical Columns Using Mean¶
In [ ]:
# Write your code
In [156]:
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].mean())
5. Identify Categorical Columns¶
In [ ]:
In [157]:
categorical_cols = df.select_dtypes(include = 'object').columns
In [81]:
categorical_cols
Out[81]:
Index(['PRODUCT_ID', 'PRODUCT_NDC', 'PRODUCT_TYPE_NAME', 'NON_PROPRIETARY_NAME', 'DOSAGEFORM_NAME', 'MARKETING_CATEGORY_NAME', 'LABELER_NAME', 'SUBSTANCE_NAME', 'ACTIVE_INGRED_UNIT', 'DEA_SCHEDULE'], dtype='object')
In [102]:
df[categorical_cols].isnull().sum()
Out[102]:
PRODUCT_ID 0 PRODUCT_NDC 0 PRODUCT_TYPE_NAME 0 NON_PROPRIETARY_NAME 0 DOSAGEFORM_NAME 0 MARKETING_CATEGORY_NAME 0 LABELER_NAME 0 SUBSTANCE_NAME 47 ACTIVE_INGRED_UNIT 47 DEA_SCHEDULE 18513 dtype: int64
6. Impute Missing Values in Categorical Columns Using Mode¶
In [106]:
categorical_cols
Out[106]:
Index(['PRODUCT_ID', 'PRODUCT_NDC', 'PRODUCT_TYPE_NAME', 'NON_PROPRIETARY_NAME', 'DOSAGEFORM_NAME', 'MARKETING_CATEGORY_NAME', 'LABELER_NAME', 'SUBSTANCE_NAME', 'ACTIVE_INGRED_UNIT', 'DEA_SCHEDULE'], dtype='object')
In [158]:
for col in categorical_cols:
df[col] = df[col].fillna(df[col].mode()[0])
In [ ]:
In [99]:
df['SUBSTANCE_NAME'].fillna(df['SUBSTANCE_NAME'].mode()[0]).isnull().sum()
Out[99]:
0
In [107]:
Out[107]:
7078 BENZOIC ACID; CETOSTEARYL ALCOHOL; POLAWAX POL... 7079 NaN 7080 LEVODOPA 7081 SILDENAFIL CITRATE 7082 ORPHENADRINE CITRATE 7083 PHENTERMINE HYDROCHLORIDE 7084 BUTYLATED HYDROXYTOLUENE 7085 BUDESONIDE 7086 ALPHA-TOCOPHEROL ACETATE; BENZOIC ACID; DIMETH... 7087 KETOROLAC TROMETHAMINE 7088 MINOCYCLINE HYDROCHLORIDE 7089 BENZOIC ACID Name: SUBSTANCE_NAME, dtype: object
7. Identify Duplicate Rows¶
In [116]:
# write your code
df.duplicated().sum()
Out[116]:
0 False 1 False 2 False 3 False 4 False ... 19586 False 19587 False 19588 False 19589 False 19590 False Length: 19591, dtype: bool
8. Identify Duplicates Based on Specific Columns¶
In [159]:
# Write your code
duplicate_specific = df[df.duplicated(subset = ['PRODUCT_NDC', 'LABELER_NAME'])]
duplicate_specific
Out[159]:
PRODUCT_ID | PRODUCT_NDC | PRODUCT_TYPE_NAME | NON_PROPRIETARY_NAME | DOSAGEFORM_NAME | START_MARKETING_DATE | END_MARKETING_DATE | MARKETING_CATEGORY_NAME | LABELER_NAME | SUBSTANCE_NAME | ACTIVE_NUMERATOR_STRENGTH | ACTIVE_INGRED_UNIT | DEA_SCHEDULE | LISTING_RECORD_CERTIFIED_THROUGH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
80 | 0009-5023_2c4ae21b-64f6-4a7e-b1a8-d9c38ba71aa8 | 0009-5023 | BULK INGREDIENT | Progesterone | POWDER | 20140703 | 2.025381e+07 | BULK INGREDIENT | Pharmacia and Upjohn Company LLC | PROGESTERONE | 1.0 | kg/kg | CII | 20241231.0 |
81 | 0009-5023_7415e030-cd86-4806-8d85-5b33539bafcb | 0009-5023 | BULK INGREDIENT | Progesterone | POWDER | 20140703 | 2.025381e+07 | BULK INGREDIENT | Pharmacia and Upjohn Company LLC | PROGESTERONE | 1.0 | kg/kg | CII | 20241231.0 |
88 | 0009-5038_8d2d213f-9a67-4a9a-b319-0a10b471dce3 | 0009-5038 | BULK INGREDIENT | Spectinomycin | POWDER | 20140703 | 2.025381e+07 | BULK INGREDIENT | Pharmacia and Upjohn Company LLC | SPECTINOMYCIN | 1.0 | kg/kg | CII | 20241231.0 |
89 | 0009-5038_f1d0432e-642e-4c8b-af8d-b37ac100e205 | 0009-5038 | BULK INGREDIENT | Spectinomycin | POWDER | 20140703 | 2.025381e+07 | BULK INGREDIENT | Pharmacia and Upjohn Company LLC | SPECTINOMYCIN | 1.0 | kg/kg | CII | 20241231.0 |
104 | 0009-5238_56520ceb-1886-4e95-9b6a-53abfa725652 | 0009-5238 | BULK INGREDIENT | Neomycin Sulfate | POWDER | 20140703 | 2.025381e+07 | BULK INGREDIENT | Pharmacia and Upjohn Company LLC | NEOMYCIN SULFATE | 1.0 | kg/kg | CII | 20241231.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
19564 | 90027-008_ec625bfe-7aab-c607-e053-2995a90add54 | 90027-008 | BULK INGREDIENT | Sodium Hyaluronate | POWDER | 20210113 | 2.025381e+07 | BULK INGREDIENT FOR HUMAN PRESCRIPTION COMPOUN... | QINGDAO TIDA BIOTEK CO.,LTD | HYALURONATE SODIUM | 1.0 | g/g | CII | 20241231.0 |
19566 | 90027-010_ec62616d-2317-0a00-e053-2a95a90a7d3b | 90027-010 | BULK INGREDIENT | Adapalene | POWDER | 20210123 | 2.025381e+07 | BULK INGREDIENT FOR HUMAN PRESCRIPTION COMPOUN... | QINGDAO TIDA BIOTEK CO.,LTD | ADAPALENE | 1.0 | g/g | CII | 20241231.0 |
19568 | 90027-011_ec625bfe-7aac-c607-e053-2995a90add54 | 90027-011 | BULK INGREDIENT | Ibutamoren Mesylate | POWDER | 20210421 | 2.025381e+07 | BULK INGREDIENT FOR HUMAN PRESCRIPTION COMPOUN... | QINGDAO TIDA BIOTEK CO.,LTD | IBUTAMOREN MESYLATE | 1.0 | kg/kg | CII | 20241231.0 |
19569 | 90027-011_ec649ffa-3b09-61d4-e053-2a95a90a9ccc | 90027-011 | BULK INGREDIENT | Ibutamoren Mesylate | POWDER | 20210421 | 2.025381e+07 | BULK INGREDIENT FOR HUMAN PRESCRIPTION COMPOUN... | QINGDAO TIDA BIOTEK CO.,LTD | IBUTAMOREN MESYLATE | 1.0 | kg/kg | CII | 20241231.0 |
19572 | 90027-014_e9a270a9-5554-7f7c-e053-2a95a90ad8d4 | 90027-014 | BULK INGREDIENT | Dexpanthenol | LIQUID | 20220927 | 2.025381e+07 | BULK INGREDIENT FOR HUMAN PRESCRIPTION COMPOUN... | QINGDAO TIDA BIOTEK CO.,LTD | DEXPANTHENOL | 1.0 | kg/kg | CII | 20241231.0 |
333 rows × 14 columns
9. How many records are in duplicate_specific
?¶
In [126]:
duplicate_specific.shape[0]
Out[126]:
333
10. Remove duplicate rows based on specific columns, keeping the last occurrence¶
In [160]:
# Write your code
df.drop_duplicates(subset = ['PRODUCT_NDC', 'LABELER_NAME'], keep = 'last', inplace = True)
11. Standardize Product Type Names¶
In [132]:
df["PRODUCT_TYPE_NAME"].isnull().sum()
Out[132]:
0
In [140]:
df["PRODUCT_TYPE_NAME"][340:360]
Out[140]:
347 DRUG 348 DRUG 349 DRUG 350 DRUG 351 DRUG 352 DRUG 353 DRUG 354 DRUG 355 DRUG 356 DRUG 357 DRUG 358 DRUG 359 DRUG 360 DRUG 361 DRUG 362 DRUG 363 DRUG 364 DRUG 365 DRUG 366 DRUG Name: PRODUCT_TYPE_NAME, dtype: object
In [162]:
import numpy as np
In [169]:
product_type_mapping = {'DRUG FOR FURTHER PROCESSING': 'DRUG',
'HUMAN OTC DRUG': np.nan,
'HUMAN PRESCRIPTION DRUG': np.nan,
'VACCINE': np.nan,
'LICENSED VACCINE BULK INTERMEDIATE': np.nan,
'BULK INGREDIENT': 'INGREDIENT'}
df["PRODUCT_TYPE_NAME"] = df["PRODUCT_TYPE_NAME"].replace(product_type_mapping)
12. Extract Relevant Information from Product ID¶
In [178]:
df['PRODUCT_CODE'] = df['PRODUCT_ID'].str.split('_', expand = True)[0]
In [ ]:
df['PRODUCT_CODE'] = ...
13. Standardize Active Ingredient Units¶
In [181]:
df['ACTIVE_INGRED_UNIT'].head()
Out[181]:
0 [iU]/mL 1 [iU]/mL 2 [iU]/mL 3 [iU]/mL 4 [iU]/mL ... 19586 kg/kg 19587 g/g 19588 g/g 19589 g/g 19590 mg/2mL Name: ACTIVE_INGRED_UNIT, Length: 19258, dtype: object
In [182]:
# Write your code
def standardize_unit(unit):
unit = str(unit).lower().strip()
if unit in ["iu/ml", "[iu]/ml"]:
return "IU/mL"
elif unit in ["mg/1", "mg/ml"]:
return "mg/mL"
elif unit == "mg/.5ml":
return "mg/0.5mL"
elif unit in ["mg/1.4ml", "mg/2ml", "mg/2.7ml", "mg/4ml", "mg/20ml"]:
return unit.replace("ml", "mL")
elif unit in ["g/g", "kg/kg"]:
return "g/g"
elif unit in ["u/5ml", "ml/ml"]:
return unit.replace("ml", "mL")
else:
return unit.upper()
df["ACTIVE_INGRED_UNIT"] = df["ACTIVE_INGRED_UNIT"].apply(standardize_unit)
14. Create a Concatenated Product Description¶
In [191]:
# Write your code
df['PRODUCT_DESCRIPTION'] = (df['NON_PROPRIETARY_NAME'] + " " +
df['DOSAGEFORM_NAME'] + " " +
df['ACTIVE_NUMERATOR_STRENGTH'].astype(str) + " " +
df['ACTIVE_INGRED_UNIT'])
In [192]:
df['PRODUCT_DESCRIPTION'][0]
Out[192]:
'Insulin human INJECTION, SOLUTION 100.0 IU/mL'
15. Identify Outliers or Extreme Values¶
In [208]:
Q1 = df["ACTIVE_NUMERATOR_STRENGTH"].quantile(0.25)
Q3 = df["ACTIVE_NUMERATOR_STRENGTH"].quantile(0.75)
IQR = Q3 - Q1
In [196]:
Q1, Q3, IQR
Out[196]:
(1.0, 25.0, 24.0)
In [202]:
df["ACTIVE_NUMERATOR_STRENGTH"] > (Q3 + 1.5 * IQR)
Out[202]:
0 True 1 True 2 True 3 True 4 True ... 19586 False 19587 False 19588 False 19589 False 19590 True Name: ACTIVE_NUMERATOR_STRENGTH, Length: 19258, dtype: bool
In [209]:
# Write your code
outliers = df.loc[(df["ACTIVE_NUMERATOR_STRENGTH"] < (Q1 - 1.5 * IQR)) | (df["ACTIVE_NUMERATOR_STRENGTH"] > (Q3 + 1.5 * IQR)), :]
outliers
Out[209]:
PRODUCT_ID | PRODUCT_NDC | PRODUCT_TYPE_NAME | NON_PROPRIETARY_NAME | DOSAGEFORM_NAME | START_MARKETING_DATE | END_MARKETING_DATE | MARKETING_CATEGORY_NAME | LABELER_NAME | SUBSTANCE_NAME | ACTIVE_NUMERATOR_STRENGTH | ACTIVE_INGRED_UNIT | DEA_SCHEDULE | LISTING_RECORD_CERTIFIED_THROUGH | PRODUCT_CODE | PRODUCT_DESCRIPTION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0002-0013_15810e26-b479-41b8-9d13-9dc503a71a2a | 0002-0013 | DRUG | Insulin human | INJECTION, SOLUTION | 19830627 | 2.025381e+07 | DRUG FOR FURTHER PROCESSING | Eli Lilly and Company | INSULIN HUMAN | 100.000000 | IU/mL | CII | 20241231.0 | 0002-0013 | Insulin human INJECTION, SOLUTION 100.0 IU/mL |
1 | 0002-0095_15810e26-b479-41b8-9d13-9dc503a71a2a | 0002-0095 | DRUG | Insulin human | INJECTION, SOLUTION | 20151229 | 2.025381e+07 | DRUG FOR FURTHER PROCESSING | Eli Lilly and Company | INSULIN HUMAN | 500.000000 | IU/mL | CII | 20241231.0 | 0002-0095 | Insulin human INJECTION, SOLUTION 500.0 IU/mL |
2 | 0002-0096_15810e26-b479-41b8-9d13-9dc503a71a2a | 0002-0096 | DRUG | Insulin human | INJECTION, SOLUTION | 19940331 | 2.025381e+07 | DRUG FOR FURTHER PROCESSING | Eli Lilly and Company | INSULIN HUMAN | 500.000000 | IU/mL | CII | 20241231.0 | 0002-0096 | Insulin human INJECTION, SOLUTION 500.0 IU/mL |
3 | 0002-0113_55aa4078-e1e8-4338-b2b5-a9298b3c16e2 | 0002-0113 | DRUG | Insulin human | INJECTION, SUSPENSION | 19830627 | 2.025381e+07 | DRUG FOR FURTHER PROCESSING | Eli Lilly and Company | INSULIN HUMAN | 100.000000 | IU/mL | CII | 20241231.0 | 0002-0113 | Insulin human INJECTION, SUSPENSION 100.0 IU/mL |
4 | 0002-0119_55aa4078-e1e8-4338-b2b5-a9298b3c16e2 | 0002-0119 | DRUG | Insulin human | INJECTION, SUSPENSION | 20131107 | 2.025381e+07 | DRUG FOR FURTHER PROCESSING | Eli Lilly and Company | INSULIN HUMAN | 100.000000 | IU/mL | CII | 20241231.0 | 0002-0119 | Insulin human INJECTION, SUSPENSION 100.0 IU/mL |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
19444 | 83938-677_a2578f1f-f68a-41a7-bfcf-1b25f757a9d0 | 83938-677 | DRUG | Lidocaine Hydrochloride, Tetracaine Hydrochlor... | LIQUID | 20240215 | 2.025381e+07 | DRUG FOR FURTHER PROCESSING | LYNEX Cosmotech Lab Mfg Inc. | LIDOCAINE HYDROCHLORIDE; RACEPINEPHRINE HYDROC... | 25032.770628 | G/KG; MG/KG; G/KG | CII | 20251231.0 | 83938-677 | Lidocaine Hydrochloride, Tetracaine Hydrochlor... |
19445 | 83983-111_0f1f9e6d-d164-52a1-e063-6394a90a285b | 83983-111 | INGREDIENT | Ethanol | LIQUID | 20240117 | 2.025381e+07 | BULK INGREDIENT | Taicang Xintai Alcohol Co Ltd | ALCOHOL | 100.000000 | L/100L | CII | 20251231.0 | 83983-111 | Ethanol LIQUID 100.0 L/100L |
19456 | 84144-024_14bcbce6-6bc7-2032-e063-6394a90a25c2 | 84144-024 | DRUG | No Carrier Added (NCA) Lutetium (177Lu) Chloride | SOLUTION | 20240318 | 2.025381e+07 | DRUG FOR FURTHER PROCESSING | SHINE Technologies, LLC | LUTETIUM CHLORIDE LU-177 | 88.800000 | GBQ/ML | CII | 20251231.0 | 84144-024 | No Carrier Added (NCA) Lutetium (177Lu) Chlori... |
19476 | 84331-101_2fdcc730-d372-474e-865f-b6ebfd1cfc34 | 84331-101 | DRUG | ALUMINUM CHLOROHYDRATE | SOLUTION | 20240515 | 2.025381e+07 | DRUG FOR FURTHER PROCESSING | DIA-CHEMICAL SDN. BHD. | ALUMINUM CHLOROHYDRATE | 500.000000 | G/L | CII | 20251231.0 | 84331-101 | ALUMINUM CHLOROHYDRATE SOLUTION 500.0 G/L |
19590 | 90112-3000_3467222f-bda4-4d8c-8d10-90b0a6b378c9 | 90112-3000 | DRUG | anifrolumab | INJECTION, SOLUTION | 20210730 | 2.025381e+07 | DRUG FOR FURTHER PROCESSING | AstraZeneca Nijmegen B.V. | ANIFROLUMAB | 300.000000 | mg/2mL | CII | 20241231.0 | 90112-3000 | anifrolumab INJECTION, SOLUTION 300.0 mg/2mL |
3453 rows × 16 columns
In this plot to we can see the distribution of values in the ACTIVE_NUMERATOR_STRENGTH
column of the df, highlighting the outliers identified based on the interquartile range (IQR) method. Run the below cell.
In [211]:
# Create a box plot
plt.figure(figsize=(8, 6))
sns.boxplot(x=df["ACTIVE_NUMERATOR_STRENGTH"], data=df, whis=1.5, showfliers=False)
sns.scatterplot(x=outliers["ACTIVE_NUMERATOR_STRENGTH"], y=np.random.normal(size=len(outliers)), data=outliers, color="red", label="Outliers")
plt.title("Plot of ACTIVE_NUMERATOR_STRENGTH")
plt.xlabel("ACTIVE_NUMERATOR_STRENGTH")
plt.ylabel("")
plt.legend()
plt.show()
16. Format START_MARKETING_DATE
Column¶
In [214]:
# Write your code here
df['START_MARKETING_DATE'] = pd.to_datetime(df['START_MARKETING_DATE'], format = '%Y%m%d').dt.strftime('%d-%m-%Y')
17. Drop END_MARKETING_DATE
Column¶
In [216]:
# Write your code here
df.drop(columns = 'END_MARKETING_DATE', inplace = True)
18. Drop Outliers Based on ACTIVE_NUMERATOR_STRENGTH
Column¶
In [219]:
outliers = df.loc[(df["ACTIVE_NUMERATOR_STRENGTH"] < (Q1 - 1.5 * IQR)) | (df["ACTIVE_NUMERATOR_STRENGTH"] > (Q3 + 1.5 * IQR)), :]
outliers.index
Out[219]:
Index([ 0, 1, 2, 3, 4, 5, 6, 7, 12, 14, ... 19416, 19417, 19437, 19442, 19443, 19444, 19445, 19456, 19476, 19590], dtype='int64', length=3453)
In [221]:
df_copy = df
In [222]:
# write your code
df.drop(index = outliers.index, inplace = True)
In [ ]: