Statement of Completion#361adf51
Classification in Depth with Scikit-Learn
easy
Introduction to Classification Algorithms
Resolution
Activities
Project: Introduction to classification algorithms¶
Libraries¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
Generate Your Sample Dataset¶
In [2]:
X, y = make_blobs(n_samples=1000, centers=2,
random_state=0, cluster_std=1.3)
Quick exploration¶
In [3]:
print('X shape:', X.shape)
X shape: (1000, 2)
In [4]:
print('y shape:', y.shape)
y shape: (1000,)
- Plot the relation between X1 and X2 .
These cells are intentionally left black for you to practice
In [5]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=25, cmap='bwr')
plt.colorbar()
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
Simple classification algorithm¶
2- Decision tree with Scikit-Learn:
Now, we are going to train a decision tree to classify our instances.
In [7]:
from sklearn.tree import DecisionTreeClassifier
# decision tree
tree = DecisionTreeClassifier(max_depth=3, random_state = 42)
Once our model is created, we need to train it on our data. We achieve this with the fit(...) method that all the classes corresponding to Scikit-Learn models have.
In [8]:
tree.fit(X, y)
Out[8]:
DecisionTreeClassifier(max_depth=3, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3, random_state=42)
3- Check the model predictions
These cells are intentionally left black for you to practice
In [13]:
a=tree.predict(X)
In [15]:
print(a[874])
print(a[249])
0 1
4- Now, compare the prediction with the real label.
These cells are intentionally left black for you to practice
In [16]:
Decision boundaries¶
5- Use the function provided in the task to explore what the decision domain of our tree looks like once we train it.
These cells are intentionally left black for you to practice
In [17]:
def visualize_classifier(model, X, y, ax=None, cmap='bwr'):
ax = ax or plt.gca()
# Plot the training points
ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,
clim=(y.min(), y.max()), zorder=3, alpha = 0.5)
ax.axis('tight')
ax.set_xlabel('x1')
ax.set_ylabel('x2')
# ax.axis('off')
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
np.linspace(*ylim, num=200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# Create a color plot with the results
n_classes = len(np.unique(y))
contours = ax.contourf(xx, yy, Z, alpha=0.3,
levels=np.arange(n_classes + 1) - 0.5,
cmap=cmap, clim=(y.min(), y.max()),
zorder=1)
ax.set(xlim=xlim, ylim=ylim)
Accuracy score¶
6- Use the code provided to compute the accuracy_score of the model.
These cells are intentionally left black for you to practice
In [18]:
from sklearn.metrics import accuracy_score
y_pred = tree.predict(X)
accuracy_score(y_pred,y)
Out[18]:
0.905
7- Compute the confusion matrix.
These cells are intentionally left black for you to practice
In [19]:
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm).plot(cmap = plt.get_cmap('Blues'))
Out[19]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x77fdec709550>
Check your knowledge: Visualizing a Decision Tree¶
8- First, read the datasset county_election.csv
as a pandas dataframe and run df.info()
and df.describe()
methods to better understand the dataset.
These cells are intentionally left black for you to practice
In [20]:
df1=pd.read_csv('county_election.csv')
In [21]:
df1.head()
Out[21]:
state | fipscode | county | population | hispanic | minority | female | unemployed | income | nodegree | bachelor | inactivity | obesity | density | cancer | trump | clinton | votergap | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 1001 | Autauga County | 50756 | 2.842 | 22.733 | 51.475 | 5.2 | 54366 | 13.8 | 21.9 | 28.6 | 34.1 | 91.8 | 186.5 | 73.436 | 23.957 | 49.479 |
1 | Alabama | 1003 | Baldwin County | 179878 | 4.550 | 12.934 | 51.261 | 5.5 | 49626 | 11.0 | 28.6 | 22.3 | 27.4 | 114.6 | 229.4 | 77.351 | 19.565 | 57.786 |
2 | Alabama | 1007 | Bibb County | 21587 | 2.409 | 23.930 | 46.110 | 6.6 | 39546 | 22.1 | 10.2 | 33.9 | 40.3 | 36.8 | 230.3 | 76.966 | 21.422 | 55.544 |
3 | Alabama | 1009 | Blount County | 58345 | 8.954 | 4.229 | 50.592 | 5.4 | 45567 | 21.9 | 12.3 | 28.0 | 34.6 | 88.9 | 205.3 | 89.852 | 8.470 | 81.382 |
4 | Alabama | 1011 | Bullock County | 10985 | 7.526 | 72.831 | 45.241 | 7.8 | 26580 | 34.5 | 14.1 | 31.7 | 43.0 | 17.5 | 211.2 | 24.229 | 75.090 | -50.862 |
In [22]:
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2620 entries, 0 to 2619 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 state 2620 non-null object 1 fipscode 2620 non-null int64 2 county 2620 non-null object 3 population 2620 non-null int64 4 hispanic 2620 non-null float64 5 minority 2620 non-null float64 6 female 2620 non-null float64 7 unemployed 2620 non-null float64 8 income 2620 non-null int64 9 nodegree 2620 non-null float64 10 bachelor 2620 non-null float64 11 inactivity 2620 non-null float64 12 obesity 2620 non-null float64 13 density 2620 non-null float64 14 cancer 2580 non-null float64 15 trump 2620 non-null float64 16 clinton 2620 non-null float64 17 votergap 2620 non-null float64 dtypes: float64(13), int64(3), object(2) memory usage: 368.6+ KB
In [23]:
df1.describe()
Out[23]:
fipscode | population | hispanic | minority | female | unemployed | income | nodegree | bachelor | inactivity | obesity | density | cancer | trump | clinton | votergap | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2620.000000 | 2.620000e+03 | 2620.000000 | 2620.000000 | 2620.000000 | 2620.000000 | 2620.000000 | 2620.000000 | 2620.000000 | 2620.000000 | 2620.000000 | 2620.000000 | 2580.000000 | 2620.000000 | 2620.000000 | 2620.000000 |
mean | 30697.507634 | 9.822633e+04 | 9.247841 | 14.610524 | 49.940028 | 5.486756 | 47137.843511 | 14.997672 | 20.105496 | 25.954962 | 30.991832 | 265.478931 | 228.681589 | 63.568398 | 31.706439 | 31.861962 |
std | 14945.089720 | 3.284333e+05 | 13.806757 | 15.823972 | 2.211987 | 1.946932 | 12007.432050 | 6.785728 | 8.981552 | 5.201513 | 4.496507 | 1767.597996 | 55.337794 | 15.638465 | 15.384481 | 30.889891 |
min | 1001.000000 | 4.500000e+01 | 0.205000 | 0.855000 | 28.479000 | 1.800000 | 21658.000000 | 1.300000 | 2.600000 | 8.700000 | 11.800000 | 0.100000 | 47.100000 | 4.122000 | 3.145000 | -88.725000 |
25% | 19044.500000 | 1.121975e+04 | 2.100000 | 4.116750 | 49.480500 | 4.100000 | 39152.250000 | 9.900000 | 13.900000 | 22.600000 | 28.400000 | 17.700000 | 193.750000 | 55.002500 | 20.458000 | 15.022000 |
50% | 29204.000000 | 2.570450e+04 | 4.009500 | 7.950000 | 50.364000 | 5.300000 | 45227.500000 | 13.550000 | 17.800000 | 25.800000 | 31.200000 | 46.250000 | 230.200000 | 66.699000 | 28.414000 | 38.264500 |
75% | 46005.500000 | 6.532075e+04 | 9.503000 | 19.238750 | 51.039250 | 6.500000 | 52602.250000 | 19.200000 | 23.725000 | 29.400000 | 33.800000 | 113.325000 | 265.100000 | 75.098500 | 40.011500 | 54.699000 |
max | 56045.000000 | 9.848011e+06 | 95.824000 | 93.411000 | 56.739000 | 24.000000 | 125635.000000 | 53.300000 | 75.100000 | 41.400000 | 47.600000 | 69468.400000 | 445.400000 | 95.273000 | 92.847000 | 91.636000 |
9- Separete the target and the features in two variables and create the response variable based on the columns trump and clinton.
Remember that we will consider only two predictors minority and bachelor.
These cells are intentionally left black for you to practice
In [28]:
X=df1[['minority','bachelor']]
y=np.where(df1.trump>df1.clinton,1,0)
print(X)
print(y)
minority bachelor 0 22.733 21.9 1 12.934 28.6 2 23.930 10.2 3 4.229 12.3 4 72.831 14.1 ... ... ... 2615 5.846 18.1 2616 4.778 51.9 2617 4.601 18.7 2618 5.259 21.2 2619 4.769 16.8 [2620 rows x 2 columns] [1 1 1 ... 1 1 1]
10- Initialize a Decision Tree classifier (name this variables clf) and fit on the data with a random_state: 42 and max_depth : 3 (he maximum depth of our decision tree using the max_depth parameter).
These cells are intentionally left black for you to practice
In [29]:
from sklearn.tree import DecisionTreeClassifier
tree1=DecisionTreeClassifier(max_depth=3,random_state=42)
clf=tree1.fit(X,y)
In [32]:
y_pred=tree1.predict(X)
from sklearn.metrics import accuracy_score
accuracy_score(y,y_pred)
Out[32]:
0.9080152671755726
In [33]:
from sklearn import tree
# Plot the Decision Tree trained above with parameters filled as True
plt.figure(figsize = (10,8))
tree.plot_tree(clf, filled = True)
plt.show()
- Calculate the accuracy score of the train dataset.
These cells are intentionally left black for you to practice
In [ ]:
- Select the correct code to visualize the decision Tree (tree_plot).
These cells are intentionally left black for you to practice
In [ ]: