Linear SVM - Email Spam Classifier¶
In this section, we'll build a linear SVM classifier to classify emails into spam and ham. The dataset, taken from the UCI ML repository, contains about 4600 emails labelled as spam or ham.
The dataset can be downloaded here: https://archive.ics.uci.edu/ml/datasets/spambase
Data Understanding¶
Let's first load the data and understand the attributes meanings, shape of the dataset etc.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import validation_curve
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
# load the data
email_rec = pd.read_csv(r"C:\Users\sumank\Desktop\Suman_Backup_Dec18\DataScience\models\svm - Copy\Spam.txt", sep = ',', header= None )
print(email_rec.head())
As of now, the columns are named as integers. Let's manually name the columns appropriately (column names are available at the UCI website here: https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names)
# renaming the columns
email_rec.columns = ["word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d",
"word_freq_our", "word_freq_over", "word_freq_remove", "word_freq_internet",
"word_freq_order", "word_freq_mail", "word_freq_receive", "word_freq_will",
"word_freq_people", "word_freq_report", "word_freq_addresses", "word_freq_free",
"word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit",
"word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money", "word_freq_hp",
"word_freq_hpl", "word_freq_george", "word_freq_650", "word_freq_lab", "word_freq_labs",
"word_freq_telnet", "word_freq_857", "word_freq_data", "word_freq_415", "word_freq_85",
"word_freq_technology", "word_freq_1999", "word_freq_parts", "word_freq_pm", "word_freq_direct",
"word_freq_cs", "word_freq_meeting", "word_freq_original", "word_freq_project", "word_freq_re",
"word_freq_edu", "word_freq_table", "word_freq_conference", "char_freq_;", "char_freq_(",
"char_freq_[", "char_freq_!", "char_freq_$", "char_freq_hash", "capital_run_length_average",
"capital_run_length_longest", "capital_run_length_total", "spam"]
print(email_rec.head())
# look at dimensions of the df
print(email_rec.shape)
# ensure that data type are correct
email_rec.info()
# there are no missing values in the dataset
email_rec.isnull().sum()
Let's also look at the fraction of spam and ham emails in the dataset.
# look at fraction of spam emails
# 39.4% spams
#email_rec['spam'].describe()
a =[0.2, 0.3, 0.6, 0.7, 0.5]
b= [10.2, 10.3, 10.6, 10.7, 10.5],
a.describe()
Data Preparation¶
Let's now conduct some prelimininary data preparation steps, i.e. rescaling the variables, splitting into train and test etc. To understand why rescaling is required, let's print the summary stats of all columns - you'll notice that the columns at the end (capital_run_length_longest, capital_run_length_total etc.) have much higher values (means = 52, 283 etc.) than most other columns which represent fraction of word occurrences (no. of times word appears in email/total no. of words in email).
email_rec.describe()
# splitting into X and y
X = email_rec.drop("spam", axis = 1)
y = email_rec.spam.values.astype(int)
# scaling the features
# note that the scale function standardises each column, i.e.
# x = x-mean(x)/std(x)
from sklearn.preprocessing import scale
X = scale(X)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4)
# confirm that splitting also has similar distribution of spam and ham
# emails
print(y_train.mean())
print(y_test.mean())
Model Building¶
Let's build a linear SVM mode now. The SVC()
class does that in sklearn. We highly recommend reading the documentation at least once.
help(SVC)
# Model building
# instantiate an object of class SVC()
# note that we are using cost C=1
model = SVC(C = 1)
# fit
model.fit(X_train, y_train)
# predict
y_pred = model.predict(X_test)
# Evaluate the model using confusion matrix
from sklearn import metrics
metrics.confusion_matrix(y_true=y_test, y_pred=y_pred)
# print other metrics
# accuracy
print("accuracy", metrics.accuracy_score(y_test, y_pred))
# precision
print("precision", metrics.precision_score(y_test, y_pred))
# recall/sensitivity
print("recall", metrics.recall_score(y_test, y_pred))
# specificity (% of hams correctly classified)
print("specificity", 811/(811+38))
The SVM we have built so far gives decently good results - an accuracy of 92%, sensitivity/recall (TNR) of 88%.
Interpretation of Results¶
In the confusion matrix, the elements at (0, 0) and (1,1) correspond to the more frequently occurring class, i.e. ham emails. Thus, it implies that:
- 92% of all emails are classified correctly
- 88.5% of spams are identified correctly (sensitivity/recall)
- Specificity, or % of hams classified correctly, is 95%
Hyperparameter Tuning¶
help(metrics.confusion_matrix)
K-Fold Cross Validation¶
Let's first run a simple k-fold cross validation to get a sense of the average metrics as computed over multiple folds. the easiest way to do cross-validation is to use the cross_val_score()
function.
# creating a KFold object with 5 splits
folds = KFold(n_splits = 5, shuffle = True, random_state = 4)
# instantiating a model with cost=1
model = SVC(C = 1)
# computing the cross-validation scores
# note that the argument cv takes the 'folds' object, and
# we have specified 'accuracy' as the metric
cv_results = cross_val_score(model, X_train, y_train, cv = folds, scoring = 'accuracy')
# print 5 accuracies obtained from the 5 folds
print(cv_results)
print("mean accuracy = {}".format(cv_results.mean()))
Grid Search to Find Optimal Hyperparameter C¶
K-fold CV helps us compute average metrics over multiple folds, and that is the best indication of the 'test accuracy/other metric scores' we can have.
But we want to use CV to compute the optimal values of hyperparameters (in this case, the cost C is a hyperparameter). This is done using the GridSearchCV()
method, which computes metrics (such as accuracy, recall etc.)
In this case, we have only one hyperparameter, though you can have multiple, such as C and gamma in non-linear SVMs. In that case, you need to search through a grid of multiple values of C and gamma to find the optimal combination, and hence the name GridSearchCV.
# specify range of parameters (C) as a list
params = {"C": [0.1, 1, 10, 100, 1000]}
model = SVC()
# set up grid search scheme
# note that we are still using the 5 fold CV scheme we set up earlier
model_cv = GridSearchCV(estimator = model, param_grid = params,
scoring= 'accuracy',
cv = folds,
verbose = 1,
return_train_score=True)
# fit the model - it will fit 5 folds across all values of C
model_cv.fit(X_train, y_train)
# results of grid search CV
cv_results = pd.DataFrame(model_cv.cv_results_)
cv_results
To get a better sense of how training and test accuracy varies with C, let's plot the tranining and test accuracies against C.
# To ignore warnings+
import warnings
warnings.filterwarnings("ignore")
# plot of C versus train and test scores
plt.figure(figsize=(8, 6))
plt.plot(cv_results['param_C'], cv_results['mean_test_score'])
plt.plot(cv_results['param_C'], cv_results['mean_train_score'])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.legend(['test accuracy', 'train accuracy'], loc='upper left')
plt.xscale('log')
Though the training accuracy monotonically increases with C, the test accuracy gradually reduces. Thus, we can conclude that higher values of C tend to overfit the model. This is because a high C value aims to classify all training examples correctly (since C is the cost of misclassification - if you impose a high cost on the model, it will avoid misclassifying any points by overfitting the data).
Let's finally look at the optimal C values found by GridSearchCV.
best_score = model_cv.best_score_
best_C = model_cv.best_params_['C']
print(" The highest test accuracy is {0} at C = {1}".format(best_score, best_C))
Let's now look at the metrics corresponding to C=10.
# model with the best value of C
model = SVC(C=best_C)
# fit
model.fit(X_train, y_train)
# predict
y_pred = model.predict(X_test)
# metrics
# print other metrics
# accuracy
print("accuracy", metrics.accuracy_score(y_test, y_pred))
# precision
print("precision", metrics.precision_score(y_test, y_pred))
# recall/sensitivity
print("recall", metrics.recall_score(y_test, y_pred))
Optimising for Other Evaluation Metrics¶
In this case, we had optimised (tuned) the model based on overall accuracy, though that may not always be the best metric to optimise. For example, if you are concerned more about catching all spams (positives), you may want to maximise TPR or sensitivity/recall. If, on the other hand, you want to avoid classifying hams as spams (so that any important mails don't get into the spam box), you would maximise the TNR or specificity.
# specify params
params = {"C": [0.1, 1, 10, 100, 1000]}
# specify scores/metrics in an iterable
scores = ['accuracy', 'precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for {}".format(score))
# set up GridSearch for score metric
clf = GridSearchCV(SVC(),
params,
cv=folds,
scoring=score,
return_train_score=True)
# fit
clf.fit(X_train, y_train)
print(" The highest {0} score is {1} at C = {2}".format(score, clf.best_score_, clf.best_params_))
print("\n")
# using rbf kernel, C=1, default value of gamma
model = SVC(C = 1, kernel='rbf')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# confusion matrix
confusion_matrix(y_true=y_test, y_pred=y_pred)
# creating a KFold object with 5 splits
folds = KFold(n_splits = 5, shuffle = True, random_state = 4)
# specify range of hyperparameters
# Set the parameters by cross-validation
hyper_params = [ {'gamma': [1e-2, 1e-3, 1e-4],
'C': [1, 10, 100, 1000]}]
# specify model
model = SVC(kernel="rbf")
# set up GridSearchCV()
model_cv = GridSearchCV(estimator = model,
param_grid = hyper_params,
scoring= 'accuracy',
cv = folds,
verbose = 1,
return_train_score=True)
# fit the model
model_cv.fit(X_train, y_train)
# creating a KFold object with 5 splits
folds = KFold(n_splits = 5, shuffle = True, random_state = 4)
# specify range of hyperparameters
# Set the parameters by cross-validation
hyper_params = [ {'gamma': [1e-2, 1e-3, 1e-4],
'C': [1, 10, 100, 1000]}]
# specify model
model = SVC(kernel="rbf")
# set up GridSearchCV()
model_cv = GridSearchCV(estimator = model,
param_grid = hyper_params,
scoring= 'accuracy',
cv = folds,
verbose = 1,
return_train_score=True)
# fit the model
model_cv.fit(X_train, y_train)
# cv results
cv_results = pd.DataFrame(model_cv.cv_results_)
cv_results
# converting C to numeric type for plotting on x-axis
cv_results['param_C'] = cv_results['param_C'].astype('int')
# # plotting
plt.figure(figsize=(16,6))
# subplot 1/3
plt.subplot(131)
gamma_01 = cv_results[cv_results['param_gamma']==0.01]
plt.plot(gamma_01["param_C"], gamma_01["mean_test_score"])
plt.plot(gamma_01["param_C"], gamma_01["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.01")
plt.ylim([0.80, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='upper left')
plt.xscale('log')
# subplot 2/3
plt.subplot(132)
gamma_001 = cv_results[cv_results['param_gamma']==0.001]
plt.plot(gamma_001["param_C"], gamma_001["mean_test_score"])
plt.plot(gamma_001["param_C"], gamma_001["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.001")
plt.ylim([0.80, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='upper left')
plt.xscale('log')
# subplot 3/3
plt.subplot(133)
gamma_0001 = cv_results[cv_results['param_gamma']==0.0001]
plt.plot(gamma_0001["param_C"], gamma_0001["mean_test_score"])
plt.plot(gamma_0001["param_C"], gamma_0001["mean_train_score"])
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.title("Gamma=0.0001")
plt.ylim([0.80, 1])
plt.legend(['test accuracy', 'train accuracy'], loc='upper left')
plt.xscale('log')
# printing the optimal accuracy score and hyperparameters
best_score = model_cv.best_score_
best_hyperparams = model_cv.best_params_
print("The best test score is {0} corresponding to hyperparameters {1}".format(best_score, best_hyperparams))
# specify optimal hyperparameters
best_params = {"C": 100, "gamma": 0.0001, "kernel":"rbf"}
# model
model = SVC(C=10, gamma=0.001, kernel="rbf")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# metrics
print(metrics.confusion_matrix(y_test, y_pred), "\n")
print("accuracy", metrics.accuracy_score(y_test, y_pred))
print("precision", metrics.precision_score(y_test, y_pred))
print("sensitivity/recall", metrics.recall_score(y_test, y_pred))
Thus, you can see that the optimal value of the hyperparameter varies significantly with the choice of evaluation metric.
No comments:
Post a Comment