Decision Tree: Income Prediction¶
In this lab, we will build a decision tree to predict the income of a given population, which is labelled as <=$50K and >$50K. The attributes (predictors) are age, working class type, marital status, gender, race etc.
In the following sections, we'll:
- clean and prepare the data,
- build a decision tree with default hyperparameters,
- understand all the hyperparameters that we can tune, and finally
- choose the optimal hyperparameters using grid search cross-validation.
Understanding and Cleaning the Data¶
# Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
# Reading the csv file and putting it into 'df' object.
df = pd.read_csv("adult_dataset.csv")
# Let's understand the type of values in each column of our dataframe 'df'.
df.info()
# Let's understand the data, how it look like.
df.head()
You can observe that the columns workclass and occupation consist of missing values which are represented as '?' in the dataframe.
On looking a bit more closely, you will also find that whenever workclass is having a missing value, occupation is also missing in that row. Let's check how may rows are missing.
# rows with missing values represented as'?'.
df_1 = df[df.workclass == '?']
df_1
df[df.workclass == '?']
Now we can check the number of rows in df_1.
df_1.info()
There are 1836 rows with missing values, which is about 5% of the total data. We choose to simply drop these rows.
# dropping the rows having missing values in workclass
df = df[df['workclass'] != '?']
df.head()
Let's see whether any other columns contain a "?". Since "?" is a string, we can apply this check only on the categorical columns.
# select all categorical variables
df_categorical = df.select_dtypes(include=['object'])
# checking whether any other columns contain a "?"
df_categorical.apply(lambda x: x=="?", axis=0).sum()
df.select_dtypes(include=['int64'])
Thus, the columns occupation and native.country contain some "?"s. Let's get rid of them.
# dropping the "?"s
df = df[df['occupation'] != '?']
df = df[df['native.country'] != '?']
Now we have a clean dataframe which is ready for model building.
# clean dataframe
df.info()
Data Preparation¶
There are a number of preprocessing steps we need to do before building the model.
Firstly, note that we have both categorical and numeric features as predictors. In previous models such as linear and logistic regression, we had created dummy variables for categorical variables, since those models (being mathematical equations) can process only numeric variables.
All that is not required in decision trees, since they can process categorical variables easily. However, we still need to encode the categorical variables into a standard format so that sklearn can understand them and build the tree. We'll do that using the LabelEncoder()
class, which comes with sklearn.preprocessing
.
You can read the documentation of LabelEncoder
here.
from sklearn import preprocessing
# encode categorical variables using Label Encoder
# select all categorical variables
df_categorical = df.select_dtypes(include=['object'])
df_categorical.head()
df_categorical.head()
# apply Label encoder to df_categorical
le = preprocessing.LabelEncoder()
df_categorical = df_categorical.apply(le.fit_transform)
df_categorical.head()
# concat df_categorical with original df
df = df.drop(df_categorical.columns, axis=1)
df = pd.concat([df, df_categorical], axis=1)
df.head()
# look at column types
df.info()
# convert target variable income to categorical
df['income'] = df['income'].astype('category')
Now all the categorical variables are suitably encoded. Let's build the model.
Model Building and Evaluation¶
Let's first build a decision tree with default hyperparameters. Then we'll use cross-validation to tune them.
# Importing train-test-split
from sklearn.model_selection import train_test_split
# Putting feature variable to X
X = df.drop('income',axis=1)
# Putting response variable to y
y = df['income']
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.30,
random_state = 99)
X_train.head()
X_train.shape
# Importing decision tree classifier from sklearn library
from sklearn.tree import DecisionTreeClassifier
# Fitting the decision tree with default hyperparameters, apart from
# max_depth which is 5 so that we can plot and read the tree.
dt_default = DecisionTreeClassifier(max_depth=5)
dt_default.fit(X_train, y_train)
# Let's check the evaluation metrics of our default model
# Importing classification report and confusion matrix from sklearn metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Making predictions
y_pred_default = dt_default.predict(X_test)
# Printing classification report
print(classification_report(y_test, y_pred_default))
# Printing confusion matrix and accuracy
print(confusion_matrix(y_test,y_pred_default))
print(accuracy_score(y_test,y_pred_default))
Plotting the Decision Tree¶
To visualise decision trees in python, you need to install certain external libraries. You can read about the process in detail here: http://scikit-learn.org/stable/modules/tree.html
We need the graphviz
library to plot a tree.
# Importing required packages for visualization
from IPython.display import Image
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydotplus, graphviz
# Putting features
features = list(df.columns[1:])
features
Note:
Python requires library pydot and an external software graphviz to visualize the decision tree. If you are on wondows, you'll need to specify the path for the pydot library to access dot file from graphviz.
Please read the downloadable instructions to install graphviz. For Mac users, one way is to:
- Install the python graphviz module:
pip install graphviz
- Then install the Graphviz software on Mac, you do this using homebrew:
- Install homebrew: https://docs.brew.sh/Installation
brew install graphviz
# If you're on windows:
# Specifing path for dot file.
# import os
# os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/graphviz-2.38/release/bin/'
# plotting tree with max_depth=3
dot_data = StringIO()
export_graphviz(dt_default, out_file=dot_data,
feature_names=features, filled=True,rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Hyperparameter Tuning¶
The default tree is quite complex, and we need to simplify it by tuning the hyperparameters.
First, let's understand the parameters in a decision tree. You can read this in the documentation using help(DecisionTreeClassifier)
.
- criterion (Gini/IG or entropy): It defines the function to measure the quality of a split. Sklearn supports “gini” criteria for Gini Index & “entropy” for Information Gain. By default, it takes the value “gini”.
- splitter: It defines the strategy to choose the split at each node. Supports “best” value to choose the best split & “random” to choose the best random split. By default, it takes “best” value.
- max_features: It defines the no. of features to consider when looking for the best split. We can input integer, float, string & None value.
- If an integer is inputted then it considers that value as max features at each split.
- If float value is taken then it shows the percentage of features at each split.
- If “auto” or “sqrt” is taken then max_features=sqrt(n_features).
- If “log2” is taken then max_features= log2(n_features).
- If None, then max_features=n_features. By default, it takes “None” value.
- max_depth: The max_depth parameter denotes maximum depth of the tree. It can take any integer value or None. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. By default, it takes “None” value.
- min_samples_split: This tells above the minimum no. of samples reqd. to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then it shows percentage. By default, it takes “2” value.
- min_samples_leaf: The minimum number of samples required to be at a leaf node. If an integer value is taken then consider - -min_samples_leaf as the minimum no. If float, then it shows percentage. By default, it takes “1” value.
- max_leaf_nodes: It defines the maximum number of possible leaf nodes. If None then it takes an unlimited number of leaf nodes. By default, it takes “None” value.
- min_impurity_split: It defines the threshold for early stopping tree growth. A node will split if its impurity is above the threshold otherwise it is a leaf.
Tuning max_depth¶
Let's first try to find the optimum values for max_depth and understand how the value of max_depth affects the decision tree.
Here, we are creating a dataframe with max_depth in range 1 to 80 and checking the accuracy score corresponding to each max_depth.
To reiterate, a grid search scheme consists of:
- an estimator (classifier such as SVC() or decision tree)
- a parameter space
- a method for searching or sampling candidates (optional)
- a cross-validation scheme, and
- a score function (accuracy, roc_auc etc.)
# GridSearchCV to find optimal max_depth
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
# specify number of folds for k-fold CV
n_folds = 5
# parameters to build the model on
parameters = {'max_depth': range(1, 40)}
# instantiate the model
dtree = DecisionTreeClassifier(criterion = "gini",
random_state = 100)
# fit tree on training data
tree = GridSearchCV(dtree, parameters,
cv=n_folds,
scoring="accuracy")
tree.fit(X_train, y_train)
# scores of GridSearch CV
scores = tree.cv_results_
pd.DataFrame(scores).head()
Now let's visualize how train and test score changes with max_depth.
# plotting accuracies with max_depth
plt.figure()
plt.plot(scores["param_max_depth"],
scores["mean_train_score"],
label="training accuracy")
plt.plot(scores["param_max_depth"],
scores["mean_test_score"],
label="test accuracy")
plt.xlabel("max_depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
You can see that as we increase the value of max_depth, both training and test score increase till about max-depth = 10, after which the test score gradually reduces. Note that the scores are average accuracies across the 5-folds.
Thus, it is clear that the model is overfitting the training data if the max_depth is too high. Next, let's see how the model behaves with other hyperparameters.
Tuning min_samples_leaf¶
The hyperparameter min_samples_leaf indicates the minimum number of samples required to be at a leaf.
So if the values of min_samples_leaf is less, say 5, then the will be constructed even if a leaf has 5, 6 etc. observations (and is likely to overfit).
Let's see what will be the optimum value for min_samples_leaf.
# GridSearchCV to find optimal max_depth
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
# specify number of folds for k-fold CV
n_folds = 5
# parameters to build the model on
parameters = {'min_samples_leaf': range(5, 200, 20)}
# instantiate the model
dtree = DecisionTreeClassifier(criterion = "gini",
random_state = 100)
# fit tree on training data
tree = GridSearchCV(dtree, parameters,
cv=n_folds,
scoring="accuracy")
tree.fit(X_train, y_train)
# scores of GridSearch CV
scores = tree.cv_results_
pd.DataFrame(scores).head()
# plotting accuracies with min_samples_leaf
plt.figure()
plt.plot(scores["param_min_samples_leaf"],
scores["mean_train_score"],
label="training accuracy")
plt.plot(scores["param_min_samples_leaf"],
scores["mean_test_score"],
label="test accuracy")
plt.xlabel("min_samples_leaf")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
You can see that at low values of min_samples_leaf, the tree gets a bit overfitted. At values > 100, however, the model becomes more stable and the training and test accuracy start to converge.
Tuning min_samples_split¶
The hyperparameter min_samples_split is the minimum no. of samples required to split an internal node. Its default value is 2, which means that even if a node is having 2 samples it can be furthur divided into leaf nodes.
# GridSearchCV to find optimal min_samples_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
# specify number of folds for k-fold CV
n_folds = 5
# parameters to build the model on
parameters = {'min_samples_split': range(5, 200, 20)}
# instantiate the model
dtree = DecisionTreeClassifier(criterion = "gini",
random_state = 100)
# fit tree on training data
tree = GridSearchCV(dtree, parameters,
cv=n_folds,
scoring="accuracy")
tree.fit(X_train, y_train)
# scores of GridSearch CV
scores = tree.cv_results_
pd.DataFrame(scores).head()
# plotting accuracies with min_samples_leaf
plt.figure()
plt.plot(scores["param_min_samples_split"],
scores["mean_train_score"],
label="training accuracy")
plt.plot(scores["param_min_samples_split"],
scores["mean_test_score"],
label="test accuracy")
plt.xlabel("min_samples_split")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
This shows that as you increase the min_samples_split, the tree overfits lesser since the model is less complex.
Grid Search to Find Optimal Hyperparameters¶
We can now use GridSearchCV to find multiple optimal hyperparameters together. Note that this time, we'll also specify the criterion (gini/entropy or IG).
# Create the parameter grid
param_grid = {
'max_depth': range(5, 15, 5),
'min_samples_leaf': range(50, 150, 50),
'min_samples_split': range(50, 150, 50),
'criterion': ["entropy", "gini"]
}
n_folds = 5
# Instantiate the grid search model
dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid = param_grid,
cv = n_folds, verbose = 1)
# Fit the grid search to the data
grid_search.fit(X_train,y_train)
# cv results
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results
# printing the optimal accuracy score and hyperparameters
print("best accuracy", grid_search.best_score_)
print(grid_search.best_estimator_)
Running the model with best parameters obtained from grid search.
# model with optimal hyperparameters
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,
max_depth=10,
min_samples_leaf=50,
min_samples_split=50)
clf_gini.fit(X_train, y_train)
# accuracy score
clf_gini.score(X_test,y_test)
# plotting the tree
dot_data = StringIO()
export_graphviz(clf_gini, out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
You can see that this tree is too complex to understand. Let's try reducing the max_depth and see how the tree looks.
# tree with max_depth = 3
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,
max_depth=3,
min_samples_leaf=50,
min_samples_split=50)
clf_gini.fit(X_train, y_train)
# score
print(clf_gini.score(X_test,y_test))
# plotting tree with max_depth=3
dot_data = StringIO()
export_graphviz(clf_gini, out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
# classification metrics
from sklearn.metrics import classification_report,confusion_matrix
y_pred = clf_gini.predict(X_test)
print(classification_report(y_test, y_pred))
# confusion matrix
print(confusion_matrix(y_test,y_pred))