Linear Regression


What is Linear Regression?


Linear Regression is the most basic implementation of Regression. It is too simple to be of any real use in the modern age machine learning applications. But is the best way to understand the concepts. This is how it works:
Consider a very simple world, where the salary of a person is related to the years of experience and performance. If someone wants to understand this relation, this is what one would do:
  1. Collect information across the industry - about various samples of salary and experience and the performance measured in terms of successful deliveries.
  2. Try to identify the relation in these,
  3. Test this relation with some more data and make required changes
  4. Repeat step 3 till you are satisfied with the results.
  5. Conclude that the relation is established and use it going forward, for making predictions about expected salary.
Now let's check this process in more detail. Assume you have successfully gathered the required data as follows. This is the first step. Now Linear Regression helps you with steps 2/3/4. You start with assuming some 'Linear' relation.
Salary = a * experience + b * performance + c
Here, the Salary is the output, Experience and Performance are the inputs. a, b, c are the parameters. Now, the question boils down to identifying the values of a, b and c. We all know that there is some relation between the inputs and outputs. But identifying such relation using logical deduction is almost an impossible task - because it requires extreme analysis. Regression helps you identify this relation - not through logic, but by learning.
Learning begins with a hypothesis. Propose a particular solution to the problem - based on what you already know, or just a random guess. Naturally, hypothesis based on previous knowledge helps speed up things. But, when you know nothing about the topic, a random guess is not bad either. A better word for this random guess is hypothesis. So, we start with some hypothesis
Salary = a0 * experience + b0 * performance + c0
Here, a0, b0 and c0 are arbitrary random numbers that we start with. We choose random numbers instead of 0's because that helps us begin better, with more relevant initialization. Ofcourse, this is not the correct solution. It is just a hypothesis - that we will verify and refine as we move on. In fact, any real life problem will never find a perfect solution in a numerical equation. We call it a good solution if the error is minimal or perhaps, just less than a given threshold - that is good enough for our purpose. Now, we test this hypothesis. There are may ways to test a hypothesis. Brute force is the easiest to understand. So we will start with that. We validate each records in the 'Training Set', to identify the error. Then we find out the 'Mean Square Error' - that is a measure of how good or bad is our hypothesis.
Error = Sum((a0 * e(i) + b0 * p(i) + c(0) - s(i))**2) / (number of samples)
This is the most basic formula to calculate the error. A you go along, you will come across more meaningful and complicated ways of calculating the error.
This error tells us how bad is our hypothesis. If we are extremely lucky, this could be 0 right away. But, Murphy says that is not the case. So we have to adjust our parameters - a, b, c - to a1, b1, c1. We go through this correction again and then get a2, b2, c2 and so on... till we get values of a, b, c such that the error is within the threshold that we need - and the problem is solved! That sounds too simple. But how do we get a1,b1,c1 or a2,b2,c2 ...? The error that we get, tells us our hypothesis is not correct. Not just that, it guides us with the direction and amount of the change required to each parameter. This is identified using partial derivatives. Remember high school calculus? Machine Learning is full of calculus and linear algebra. Check out this Calculus Tutorial if you would like a refresher.
The derivative shows us the way in which we need to move and the amount. Higher the derivative, further away is the ideal and we need to take larger steps towards the end. The ideal point is where the derivatives are 0. That is the point where error is at the minimum possible level. Obviously, the mean square error is extremely high if the values of a,b,c are very low and also if these values are very high. So the optimum is between the two. Since our hypothesis is a linear equation, the mean square error would be a quadratic equation - having a single minimum. That simplifies our task of finding the minimum. We calculate the partial derivatives of the error expression with respect to a, b and c - Call them da, db and dc. Based on these we pick the next values:
a1 = a0 - alpha * da

b1 = b0 - alpha * db

c1 = c0 - alpha * dc
Here alpha is the learning rate. It is some positive number, that we choose based on our experience about machine learning. We can see many such Hyper Paramteres in the machine learning domain. Choosing the right hyperparameter is very important in machine learning. The hyperparameters differentiate successful and unsuccessful machine learning projects. Partial derivative of f(x, y, z) with respect to x can be calculated as
(f(x + d, y, z) - f(x - d, y, z)) / 2d    # for very small values of d.
According to this,
da = (2a * sum(e) + b ( sum(e) * sum(p) ) + c ( sum(e)) - sum(s) ) / N

   = (sum(e) * (2a + b * sum(p) + c - sum(s))) / N
Thus,
a(j) = a(j-1) - alpha * (sum(e) * (2a(j-1) + b(j-1) * sum(p) + c(j-1) - sum(s))) / N
The process of predicting the salary set based on the current set of parameters is called forward propagation. And the process of correcting the parameters based on the error is called backward propagation. One complete cycle of forward and backward propagation is called an epoch. Each epoch should take us closer to the optimal if the hyperparameters are correctly chosen. This process is called Regression. Since we used a linear hypothesis, we call it Linear Regression. As we train the parameters through several epochs, the error level can go below the threshold. But this is only the beginning. We need to do a lot more to get things working. We can have problems like overfitting, underfitting or local optimum.
Obviously very few real life applications are simple enough to fit in a straight line. Most of them will need a lot more than this. For these, we have Polynomial Regression, Logistic Regression, Neural Networks and many other types of input - output relations that can used as a model for the hypothesis. But core the concept of regression remains the same. Ofcourse, the calculations get more and more complex along with the hypothesis. But don't worry about that! I showed the elaborate calculations only to give you a feel of what goes on in the learning process. Generous developers have made tons of open source libraries (Scikit-learn, Tensorflow, Keras...) that can help us do all this and a lot more - in just a few lines of code.

Python Implementation

ScikitLearn provides an easy implementation for Linear Regression. It also provides using built in data sets to check out the algorithms. For this algorithm, let us check out the diabetes dataset. It provides data for predicting the blood sugar based on the different parameters in the body.
We start by importing the relevant modules
import numpy as np 
from sklearn import datasets, linear_model 
Next we load the data set
diabetes = datasets.load_diabetes()
We can get a glimpse of the contents of the dataset
In [1]: diabetes.data[:3]
Out[1]: 
array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567061, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286377, -0.02593034]])
IN [2]: diabetes.target[:3]
Out[2]: array([151.,  75., 141.])
Next we instantiate the Linear Regression model
regr = linear_model.LinearRegression()
Now we need to train this with the data available. And then test it to verify the correctness. To do this, we need to split the data into train and test sets
X_train = diabetes.data[:-20]
y_train = diabetes.target[:-20]
X_test = diabetes.data[-20:]
y_test = diabetes.target[-20:]
This gives us two sets of data. Now we can train the model to fit it into this data.
regr.fit(X_train, y_train)
To check out what we have done, we can try to measure the correctness of the model developed. ScikitLearn gives us easy methods to do that.
In [4]: regr.score(X_test, y_test)
Out[4]: 0.5850753022690571

In [5]: regr.score(X_train, y_train)
Out[5]: 0.5122262147793499
This is pathetic performance. And one cannot expect anything better from linear regression. But it was good enough to display how things go through.