Project: Introduction to classification algorithms¶

Libraries¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs

Generate Your Sample Dataset¶

In [2]:

X, y = make_blobs(n_samples=1000, centers=2,
                  random_state=0, cluster_std=1.3)

Quick exploration¶

In [3]:

print('X shape:', X.shape)

X shape: (1000, 2)

In [4]:

print('y shape:', y.shape)

y shape: (1000,)

Plot the relation between X1 and X2 .

In [5]:

plt.scatter(X[:, 0], X[:, 1], c=y, s=25, cmap='bwr')
plt.colorbar()
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()

No description has been provided for this image

Simple classification algorithm¶

2- Decision tree with Scikit-Learn:

Now, we are going to train a decision tree to classify our instances.

In [7]:

from sklearn.tree import DecisionTreeClassifier

# decision tree
tree = DecisionTreeClassifier(max_depth=3, random_state = 42)

Once our model is created, we need to train it on our data. We achieve this with the fit(...) method that all the classes corresponding to Scikit-Learn models have.

In [8]:

tree.fit(X, y)

Out[8]:

DecisionTreeClassifier(max_depth=3, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

3- Check the model predictions

In [13]:

a=tree.predict(X)

In [15]:

print(a[874])
print(a[249])

0
1

4- Now, compare the prediction with the real label.

In [16]:

Decision boundaries¶

5- Use the function provided in the task to explore what the decision domain of our tree looks like once we train it.

In [17]:

def visualize_classifier(model, X, y, ax=None, cmap='bwr'):
    ax = ax or plt.gca()

    # Plot the training points
    ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,
               clim=(y.min(), y.max()), zorder=3, alpha = 0.5)
    ax.axis('tight')
    ax.set_xlabel('x1')
    ax.set_ylabel('x2')
#     ax.axis('off')
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
                         np.linspace(*ylim, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    # Create a color plot with the results
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap=cmap, clim=(y.min(), y.max()),
                           zorder=1)

    ax.set(xlim=xlim, ylim=ylim)

Accuracy score¶

6- Use the code provided to compute the accuracy_score of the model.

In [18]:

from sklearn.metrics import accuracy_score

y_pred = tree.predict(X)
accuracy_score(y_pred,y)

Out[18]:

0.905

7- Compute the confusion matrix.

In [19]:

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y, y_pred)
ConfusionMatrixDisplay(confusion_matrix=cm).plot(cmap = plt.get_cmap('Blues'))

Out[19]:

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x77fdec709550>

Check your knowledge: Visualizing a Decision Tree¶

8- First, read the datasset county_election.csv as a pandas dataframe and run df.info() and df.describe() methods to better understand the dataset.

In [20]:

df1=pd.read_csv('county_election.csv')

In [21]:

df1.head()

Out[21]:

	state	fipscode	county	population	hispanic	minority	female	unemployed	income	nodegree	bachelor	inactivity	obesity	density	cancer	trump	clinton	votergap
0	Alabama	1001	Autauga County	50756	2.842	22.733	51.475	5.2	54366	13.8	21.9	28.6	34.1	91.8	186.5	73.436	23.957	49.479
1	Alabama	1003	Baldwin County	179878	4.550	12.934	51.261	5.5	49626	11.0	28.6	22.3	27.4	114.6	229.4	77.351	19.565	57.786
2	Alabama	1007	Bibb County	21587	2.409	23.930	46.110	6.6	39546	22.1	10.2	33.9	40.3	36.8	230.3	76.966	21.422	55.544
3	Alabama	1009	Blount County	58345	8.954	4.229	50.592	5.4	45567	21.9	12.3	28.0	34.6	88.9	205.3	89.852	8.470	81.382
4	Alabama	1011	Bullock County	10985	7.526	72.831	45.241	7.8	26580	34.5	14.1	31.7	43.0	17.5	211.2	24.229	75.090	-50.862

In [22]:

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2620 entries, 0 to 2619
Data columns (total 18 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   state       2620 non-null   object 
 1   fipscode    2620 non-null   int64  
 2   county      2620 non-null   object 
 3   population  2620 non-null   int64  
 4   hispanic    2620 non-null   float64
 5   minority    2620 non-null   float64
 6   female      2620 non-null   float64
 7   unemployed  2620 non-null   float64
 8   income      2620 non-null   int64  
 9   nodegree    2620 non-null   float64
 10  bachelor    2620 non-null   float64
 11  inactivity  2620 non-null   float64
 12  obesity     2620 non-null   float64
 13  density     2620 non-null   float64
 14  cancer      2580 non-null   float64
 15  trump       2620 non-null   float64
 16  clinton     2620 non-null   float64
 17  votergap    2620 non-null   float64
dtypes: float64(13), int64(3), object(2)
memory usage: 368.6+ KB

In [23]:

df1.describe()

Out[23]:

	fipscode	population	hispanic	minority	female	unemployed	income	nodegree	bachelor	inactivity	obesity	density	cancer	trump	clinton	votergap
count	2620.000000	2.620000e+03	2620.000000	2620.000000	2620.000000	2620.000000	2620.000000	2620.000000	2620.000000	2620.000000	2620.000000	2620.000000	2580.000000	2620.000000	2620.000000	2620.000000
mean	30697.507634	9.822633e+04	9.247841	14.610524	49.940028	5.486756	47137.843511	14.997672	20.105496	25.954962	30.991832	265.478931	228.681589	63.568398	31.706439	31.861962
std	14945.089720	3.284333e+05	13.806757	15.823972	2.211987	1.946932	12007.432050	6.785728	8.981552	5.201513	4.496507	1767.597996	55.337794	15.638465	15.384481	30.889891
min	1001.000000	4.500000e+01	0.205000	0.855000	28.479000	1.800000	21658.000000	1.300000	2.600000	8.700000	11.800000	0.100000	47.100000	4.122000	3.145000	-88.725000
25%	19044.500000	1.121975e+04	2.100000	4.116750	49.480500	4.100000	39152.250000	9.900000	13.900000	22.600000	28.400000	17.700000	193.750000	55.002500	20.458000	15.022000
50%	29204.000000	2.570450e+04	4.009500	7.950000	50.364000	5.300000	45227.500000	13.550000	17.800000	25.800000	31.200000	46.250000	230.200000	66.699000	28.414000	38.264500
75%	46005.500000	6.532075e+04	9.503000	19.238750	51.039250	6.500000	52602.250000	19.200000	23.725000	29.400000	33.800000	113.325000	265.100000	75.098500	40.011500	54.699000
max	56045.000000	9.848011e+06	95.824000	93.411000	56.739000	24.000000	125635.000000	53.300000	75.100000	41.400000	47.600000	69468.400000	445.400000	95.273000	92.847000	91.636000

9- Separete the target and the features in two variables and create the response variable based on the columns trump and clinton.

Remember that we will consider only two predictors minority and bachelor.

In [28]:

X=df1[['minority','bachelor']]
y=np.where(df1.trump>df1.clinton,1,0)
print(X)
print(y)

      minority  bachelor
0       22.733      21.9
1       12.934      28.6
2       23.930      10.2
3        4.229      12.3
4       72.831      14.1
...        ...       ...
2615     5.846      18.1
2616     4.778      51.9
2617     4.601      18.7
2618     5.259      21.2
2619     4.769      16.8

[2620 rows x 2 columns]
[1 1 1 ... 1 1 1]

10- Initialize a Decision Tree classifier (name this variables clf) and fit on the data with a random_state: 42 and max_depth : 3 (he maximum depth of our decision tree using the max_depth parameter).

In [29]:

from sklearn.tree import DecisionTreeClassifier
tree1=DecisionTreeClassifier(max_depth=3,random_state=42)
clf=tree1.fit(X,y)

In [32]:

y_pred=tree1.predict(X)
from sklearn.metrics import accuracy_score
accuracy_score(y,y_pred)

Out[32]:

0.9080152671755726

In [33]:

from sklearn import tree
# Plot the Decision Tree trained above with parameters filled as True
plt.figure(figsize = (10,8))
tree.plot_tree(clf, filled = True)
plt.show()

Calculate the accuracy score of the train dataset.

In [ ]:

Select the correct code to visualize the decision Tree (tree_plot).