Basic Concepts of Supervised Learning

We have seen some of the major supervised learning algorithms. Now lets look into some important concepts that are required when we work on real life problems.

Hypothesis

The Hypothesis is the assumed relation between the input and output. This hypothesis is verified and improved on each iteration. In the implementation level, this could be the individual coefficients in a polynomial expression or nodes of a neural network, etc. Essentially it is a relation that we propose for correction and refining through the process of learning.

Weights

The hypothesis relation is identified by parameters that can be altered to make it fit the given data set. For example, in a polynomial expression, the coefficients are the weighs. Typically these are denoted by θ0, θ1 ...

Cost Function

The cost of a hypothesis is a measure of the overall gap between the actual expected output and the output calculated by the hypothesis. This cost is naturally a related to the weights that we use to define the hypothesis. Cost function is the formal definition of this cost - as a function of these weights. Given the cost function, a good hypothesis is one with minimal cost.

Regression

Typically, the regression process of supervised learning begins with proposing a relation between the input and output. This relation could be as simple as a linear equation or it could be a polynomial or even a massive neural network. Each such relations have many parameters. For example, a linear equation y = ax + b has parameters a and b - which define the equation. The kind of proposal depends upon the analysis of the situation, availability of data and of course the experience of the developer. But once a good model is defined, the next step is to identify the values of these parameters.
Thus, the process of regression consists of multiple iterations of forward propagation and backward propagation. These two steps consist of first finding the error based on the current weights (forward propagation) and then updating the weights based on the last calculated error (backward propagation). Multiple iterations of forward and backward propagation can gradually refine the model. This is the essence of regression - the most fundamental concept in supervised learning.

Gradient Descent

Gradient descent is one of the most popular methods for identifying the parameters of a learning model. Essentially gradient descent consists of identifying the Error content of the model - for the available data. Then, gradually updating the parameters of the model in a way that the best error reduction is ensured on every step. This can be visualized as a ball rolling down a curved surface. At every point it moves in the direction that gives the best reduction in its height. In order to perform Gradient Descent, we need to obtain a good measure of the error function.

Stochastic Gradient Descent

In traditional gradient descent, we run the forward and backward propagation over the entire data - to identify the error function and then try to regressively optimize the model. If this is computationally expensive, why not run through one data point at a time? Stochastic gradient descent consists of training over one data sample at a time. The learning curve of stochastic gradient descent is not as clean as the classical gradient descent algorithm, but it does converge very quickly if the hyperparameters are chosen well. But, stochastic gradient descent can be disasterous if we get stuck with the wrong hyperparameters.
A major advantage is when we do not have all the data available right away - for example data streaming into the system over the internet.

Mini Batch Gradient Descent.

Stochastic gradient descent is an extreme just as Gradient descent. We all know extremes are limited. Mini Batch Gradient Descent tries to merge the benefits of both by going through the process in small batches.

Error Function

This is an important part of the story. The efficiency of the gradient descent method depends heavily on the function that we use to represent the error for the proposed parameters.
  • Mean Square Error - used for continuous regression
  • Cross Entropy - used for classification
The cost function can be seen as a cumulative of the error function

Underfitting

Any amount of effort on minimizing the cost function will not help if the hypothesis is not rich enough. For example, if you want to fit a complex data in a simple linear expression, that just can't work. This is scenario is called underfitting.

Overfitting

This is the other extreme of underfitting. It is also possible that the hypothesis is excessively rich. In such a case, the hypothesis perfectly fits the available dataset, but it curves heavily between these points, making it very bad for any data that is not in the input set. Such a hypothesis will seem perfect while training, but will make absurd predictions when tested on another dataset. For example, we can have a high order polynomial fitting a set of points in a straight line. While training we may feel we have done a great job that fits with zero error. But for any point out of the training set, the calculations will go for a toss.