Supervised Learning & Regression


Supervised learning is probably the most fundamental concept in Machine Learning. Linear Regression is the most basic implementation of Supervised learning. It is too simple to be of any real use, in the modern age machine learning applications. But is the best way to understand the basics.
This is how it works - Consider a very simple world, where salary of a person is related to the years of experience and performance. If someone wants to understand this relation, this is what one would do:
  • Collect information across the industry - about various samples of salary and experience and the performance measured in terms of successful deliveries.
  • Try to identify the relation in these,
  • Test this relation with some more data and make required changes
  • Repeat step 3 till you are satisfied with the results.
  • Conclude that the relation is established and use it going forward, for making predictions about expected salary.
Now let's check this process in more detail. Assume you have successfully gathered the required data as follows. This is the first step. Now Linear Regression helps you with steps 2/3/4. You start with assuming some 'Linear' relation.
Salary = a * experience + b * performance + c
Here, the Salary is the output, Experience and Performance are the inputs. a, b, c are the parameters. Now, the question boils down to identifying the values of a, b and c. We all know that there is some relation between the inputs and outputs. But identifying such relation using logical deduction is almost an impossible task - because it requires extreme analysis. Regression helps you identify this relation - not through logic, but by learning.
Learning begins with a hypothesis. Propose a particular solution to the problem - based on what you already know, or just a random guess. Naturally, hypothesis based on previous knowledge helps speed up things. But, when you know nothing about the topic, a random guess is not bad either. A better word for this random guess is hypothesis. So, we start with some hypothesis
Salary = a0 * experience + b0 * performance + c0
Here, a0, b0 and c0 are arbitrary random numbers that we start with. We choose random numbers instead of 0's because that helps us begin better, with more relevant initialization. Ofcourse, this is not the correct solution. It is just a hypothesis - that we will verify and refine as we move on. In fact, any real life problem will never find a perfect solution in a numerical equation. We call it a good solution if the error is minimal or perhaps, just less than a given threshold - that is good enough for our purpose. Now, we test this hypothesis. There are may ways to test a hypothesis. Brute force is the easiest to understand. So we will start with that. We validate each records in the 'Training Set', to identify the error. Then we find out the 'Mean Square Error' - that is a measure of how good or bad is our hypothesis.
Error = Sum((a0 * e(i) + b0 * p(i) + c(0) - s(i))**2) / (number of samples)
This is the most basic formula to calculate the error. A you go along, you will come across more meaningful and complicated ways of calculating the error.
This error tells us how bad is our hypothesis. If we are extremely lucky, this could be 0 right away. But, Murphy says that is not the case. So we have to adjust our parameters - a, b, c - to a1, b1, c1. We go through this correction again and then get a2, b2, c2 and so on... till we get values of a, b, c such that the error is within the threshold that we need - and the problem is solved! That sounds too simple. But how do we get a1,b1,c1 or a2,b2,c2 ...? The error that we get, tells us our hypothesis is not correct. Not just that, it guides us with the direction and amount of the change required to each parameter. This is identified using partial derivatives. Remember high school calculus? Machine Learning is full of calculus and linear algebra. Check out this Calculus Tutorial if you would like a refresher.
The derivative shows us the way in which we need to move and the amount. Higher the derivative, further away is the ideal and we need to take larger steps towards the end. The ideal point is where the derivatives are 0. That is the point where error is at the minimum possible level. Obviously, the mean square error is extremely high if the values of a,b,c are very low and also if these values are very high. So the optimum is between the two. Since our hypothesis is a linear equation, the mean square error would be a quadratic equation - having a single minimum. That simplifies our task of finding the minimum. We calculate the partial derivatives of the error expression with respect to a, b and c - Call them da, db and dc. Based on these we pick the next values:
a1 = a0 - alpha * da
b1 = b0 - alpha * db
c1 = c0 - alpha * dc
Here alpha is the learning rate. It is some positive number, that we choose based on our experience about machine learning. We can see many such Hyper Paramteres in the machine learning domain. Choosing the right hyperparameter is very important in machine learning. The hyperparameters differentiate successful and unsuccessful machine learning projects. Partial derivative of f(x, y, z) with respect to x can be calculated as
(f(x + d, y, z) - f(x - d, y, z)) / 2d    # for very small values of d.
According to this,
da = (2a * sum(e) + b ( sum(e) * sum(p) ) + c ( sum(e)) - sum(s) ) / N
   = (sum(e) * (2a + b * sum(p) + c - sum(s))) / N
a(j) = a(j-1) - alpha * (sum(e) * (2a(j-1) + b(j-1) * sum(p) + c(j-1) - sum(s))) / N
The process of predicting the salary set based on the current set of parameters is called forward propagation. And the process of correcting the parameters based on the error is called backward propagation. One complete cycle of forward and backward propagation is called an epoch. Each epoch should take us closer to the optimal if the hyperparameters are correctly chosen. This process is called Regression. Since we used a linear hypothesis, we call it Linear Regression. As we train the parameters through several epochs, the error level can go below the threshold. But this is only the beginning. We need to do a lot more to get things working. We can have problems like overfitting, underfitting or local optimum.
Obviously very few real life applications are simple enough to fit in a straight line. Most of them will need a lot more than this. For these, we have Polynomial Regression, Logistic Regression, Neural Networks and many other types of input - output relations that can used as a model for the hypothesis. But core the concept of regression remains the same. Ofcourse, the calculations get more and more complex along with the hypothesis. But don't worry about that! I showed the elaborate calculations only to give you a feel of what goes on in the learning process. Generous developers have made tons of open source libraries (Scikit-learn, Tensorflow, Keras...) that can help us do all this and a lot more - in just a few lines of code.

Important Terms

As discussed above, Supervised learning involves processing iteratively to identify the relation between the input and outputs. This process is formally defined by the following terms.


The Hypothesis is the assumed relation between the input and output. This hypothesis is verified and improved on each iteration.


The hypothesis relation is identified by parameters that can be altered to make it fit the given data set. For example, in the above expression, a/b/c... were weighs. Typically these are denoted by θ0, θ1 ...

Cost Function

The cost of a hypothesis is a measure of the overall gap between the actual expected output and the output calculated by the hypothesis. This cost is naturally a related to the weights that we use to define the hypothesis. Cost function is the formal definition of this cost as a function of these weights. Given the cost function, a good hypothesis is one with minimal cost.


Typically, the regression process of supervised learning begins with proposing a relation between the input and output. This relation could be as simple as a linear equation or it could be a polynomial or even a massive neural network. Each such relations have many parameters. For example, a linear equation y = ax + b has parameters a and b - which define the equation. The kind of relation is proposed depends upon the analysis of the situation, availability of data and ofcourse the experience of the developer. But once a good model is defined, the next step is to identify the values of these parameters.
Thus, the process of regression consists of multiple iterations of forward propagation and backward propagation. These two steps consist of first finding the error based on the current weights (forward propagation) and then updating the weights based on the last calculated error (backward propagation). This is the essence of regression - the most fundamental concept in machine learning.

Gradient Descent

Gradient descent is one of the most popular methods for identifying the parameters of a learning model. Essentially gradient descent consists of identifying the Error content of the model - for the available data. Then, gradually updating the parameters of the model in a way that the best error reduction is ensured on every step. This can be visualized as a ball rolling down a curved surface. At every point it moves in the direction that gives the best reduction in its height. In order to perform Gradient Descent, we need to obtain a good measure of the error function.

Stochastic Gradient Descent

In traditional gradient descent, we run the forward and backward propagation over the entire data - to identify the error function and then try to regressively optimize the model. If this is computationally expensive, why not run through one data point at a time? Stochastic gradient descent consists of training over one data sample at a time. The learning curve of stochastic gradient descent is not as clean as the classical gradient descent algorithm, but it does converge very quickly if the hyperparameters are chosen well. But, stochastic gradient descent can be disasterous if we get stuck with the wrong hyperparameters.
A major advantage is when we do not have all the data available right away - for example data streaming into the system over the internet.

Mini Batch Gradient Descent.

Stochastic gradient descent is an extreme just as Gradient descent. We all know extremes are limited. Mini Batch Gradient Descent tries to merge the benefits of both by going through the process in small batches.

Error Function

This is an important part of the story. The efficiency of the gradient descent method depends heavily on the function that we use to represent the error for the proposed parameters.
  • Mean Square Error - used for continuous regression
  • Cross Entropy - used for classification
The cost function can be seen as a cumulative of the error function


Any amount of effort on minimizing the cost function will not help if the hypothesis is not rich enough. For example, if you want to fit a complex data in a simple quadratic expression, that just can't work. This is scenario is called underfitting.


It is also possible that the hypothesis is excessively rich. It is possible that the hypothesis perfectly fits the available dataset, but it curves heavily between these points, making it very bad for any data that is not in the input set. Such a hypothesis will seem perfect while training, but will make absurd predictions when tested on another dataset.

Avoid Overfitting

Over fitting is one major problem that can lead to disappointment after a lot of effort. Hence, it is important to eliminate possibility of overfitting

Train set, dev set, test set

The input data is split into 3 parts: The training set - that is used to train the hypothesis. Once this is done, the trained hypothesis is verified on the dev set. If there is no underfitting or overfitting, it is expected that the cost for the dev set would be similar to what was observed in the training set. If the cost for Dev set is higher, it means that the hypothesis is overfitting the training set. If either cost is much higher that acceptable, it means that the hypothesis underfits the available data. In such a case, we need to try out a richer hypothesis. Once the hypothesis is finalized by working on the train/dev set, then it should be tested on the test set - to make sure all is well. The golden rule for splitting the input data into these three was 60:20:20 %. But, lately, we have seen a huge burst in the available input data. When we have millions of records in the input data, it does not make sense to push 40% of it into dev/test sets. Hence, most of it is used in the train set, so long as we have a few thousand records for the dev/test set each

Cross Validation

Cross validation takes this a step further. Divide the available data into N folds. Then sequentially use each fold to test the training by the rest of the data as the test set. Thus we run the process N times - to get a much better result. Ofcourse this also requiers a lot more processing.

Weight Penalty

A common symptom of an overfitting curve is that some of the higher weights are too large. "Large and small" are very vague terms. But for any two models, if the training error is similar, one can reasonably guess that the one with larger values for higher weights is overfitted. In other weights, for the same error in the training, one can consider the "Error" is higher in the model with larger weights. To account for this we can make a small change to the error function and add a component that tends to penalize higher weights. Thus, the gradient descent will naturally move to the model with lower weights - thus avoiding overfitting.
Depending upon how the weight error is calculated, we have two types of weithg penalty - L1 & L2. L1 penalty is calculated by summing over the absolute value of the weights and the L2 penalty is calculated using their mean square value. Each has its own peculiarities and advantages. We cannot use derivatives with L1 penalty - so it cannot be used with simple gradient methods. It requires solving a convex optimization problem. But it has some major advantages. It drives some weights exactly to zero and it learns sparce models. This is particularly useful when we have too many features. Regularization based on L1 penalty helps us identify the important ones. L2 penalty is based on mean square sum of the weights. This is differentiable and can gracefully fit into the gradient descent algorithm.

Dimensionality reduction

It is often found that all the features are not orthogonally independent. In such cases, it does help to spend some time trying to rework the features into orthogonal features - that are much less in number. With the reduction in the number of features, the problem of gradient descent is much simpler and the chances of overfitting are significantly reduced.

Data augmentation

The root of overfitting is the limitted training data set - because the amount of data available is much less than what is required to train the model for all its features. Naturally there are two ways to solve this problem - reduce the number of features or increase the data. Dimensionality reduction takes the first approach and Data augmentation takes the second. It is often possible to use the existing data set to generate a larger amount of data to better train the model. Ofcourse it is not possible to generate more information that is already available in the available data. But it is possible to pump up the information using your knowledge about the domain and the model being trained. For example, image data can be easily augmented by flipping the images of sliding about etc - because we know how the images work.


Another way to ensure that no single weight can drive the model crazy is skipping or forcing some weights to zero on every sample processed in the of the iteration. This makes sure each weight independently learns the information well enough to contribute to the net output rather than developing a tendency to drag it away.

Collect more data

If nothing else works, you have no choice but to take the trouble of collecting more data.


Essentially, regression is one of the most important algorithms used in Supervised learning. Regression involves starting with a base hypothesis of the correlation between input and output and then iteratively refine this relation by trying to minimize the error. This sounds great but there are many pitfalls in the process. Efficiency of the regression depends upon the richness of the hypothesis chosen, and the various hyperparameters used to perform the regression. A major component of the study of machine learning is dedicated to identifying the "good" hypothesis and hyperparameters.


This video has a good introduction to the topic.

If you are interested in a detailed study, you can check out this book.