Logistic Regression with ScikitLearn

Introduction

Logistic Regression is a very important concept in Supervised Machine Learning - because it helps you use the powerful techniques of Regression based learning to the Classification problems. Typically Regression algorithms deals with data where input as well as output are continuous. Logistic Regression extends the same algorithms to binary classification by introducing the concept of Activation Function that helps you map the linear output to binary. There are several different types of activation functions - Sigmoid, Tanh and Relu are the most popular ones. Essentially, these are continuous functions that generate "near binary" output. That is, the output is similar for input values much less than 0 and also for values much more than 0. And there is a strong gradient around 0. This approximates very well to a binary classifier - although mathematically, the function is continuous. Thus, we can use the powerful techniques of Regression.

Python Implementation

The first step is to import the packages
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
Next, load the breast cancer data set built into SciKitLearn
cancer = load_breast_cancer()
Now, split the data set into the training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)
Finally, create an object of LogisticRegression and try to fit the model to the given data - using the default parameters to start with.
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Now, verify the performance of the model
print('Accuracy on the training subset: {:.3f}'.format(log_reg.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(log_reg.score(X_test, y_test)))
That looks quite good!

Regularization

Overfitting has always been a serious problem in Supervised Regression Learning. If the model underfits the data, you can identify it rather quickly. But if it overfits the data, the situation can be really tricky. Regularization is one of the common approaches to avoid overfitting - by preventing any particular weight from growing too high. There are two main types of Regularization based on how the weights are penalized:

L1 Regularization

Here, the weights are penalized based on the absolute values of the weights. (L1 because it uses the values and not their squares as in L2). L1 Regularization tends to produce a sparce model by discarding a lot of weights that are not so important. This may not result in a model as good as L2; but the performance is a lot better.

L2 Regularization

Here, the weights are penalized based on the sum of squares rather than the absolute values. (L2 because it uses the squares rather than the value). L2 Regularization tends to produce a dense model by lowering the weights. Hence the model is a lot better but the performance is not so good.
ScikitLearn uses L2 by default.

Comparison of L1 and L2 Regularization

lr1 = LogisticRegression(C=0.01, penalty='l1', tol=0.01)
lr2 = LogisticRegression(C=0.01, penalty='l2', tol=0.01)

lr1.fit(X_train, y_train)
lr2.fit(X_train, y_train)

print('L1: Accuracy on the training subset: {:.3f}'.format(lr1.score(X_train, y_train)))
print('L1: Accuracy on the test subset: {:.3f}'.format(lr1.score(X_test, y_test)))
print('L2: Accuracy on the training subset: {:.3f}'.format(lr2.score(X_train, y_train)))
print('L2: Accuracy on the test subset: {:.3f}'.format(lr2.score(X_test, y_test)))
L1: Accuracy on the training subset: 0.918
L1: Accuracy on the test subset: 0.930
L2: Accuracy on the training subset: 0.925
L2: Accuracy on the test subset: 0.930
The accuracy is quite similar. But if you check out the models coefficients, you will see a major difference:
np.mean(lr1.coef_.ravel()==0)
np.mean(lr2.coef_.ravel()==0)
0.8666666666666667
0.0
You can see that almost 87% of the weights are stripped off in the L1 regularization. This is when the value if C is set to 0.01 Now let us try with a higher value of C: 100 - This means very little penalty.
lr1 = LogisticRegression(C=100, penalty='l1', tol=0.01)
lr2 = LogisticRegression(C=100, penalty='l2', tol=0.01)

lr1.fit(X_train, y_train)
lr2.fit(X_train, y_train)

print('L1: Accuracy on the training subset: {:.3f}'.format(lr1.score(X_train, y_train)))
print('L1: Accuracy on the test subset: {:.3f}'.format(lr1.score(X_test, y_test)))
print('L2: Accuracy on the training subset: {:.3f}'.format(lr2.score(X_train, y_train)))
print('L2: Accuracy on the test subset: {:.3f}'.format(lr2.score(X_test, y_test)))
L1: Accuracy on the training subset: 0.984
L1: Accuracy on the test subset: 0.979
L2: Accuracy on the training subset: 0.927
L2: Accuracy on the test subset: 0.930
In this case, the accuracy of the training as well as the test set increased with reduction in the penalty - perhaps because our data is good enough for the model and there is almost no overfitting. And when you look at the model itself
np.mean(lr1.coef_.ravel()==0)
np.mean(lr2.coef_.ravel()==0)
0.10000000000000001
0
Only 10% of the weights in the L1 are stripped off.

Comments