Avoid Overfitting


How to Avoid Overfitting?


Over fitting is one major problem that can lead to disappointment after a lot of effort. Underfitting is not so bad because we know about it when training. Overfitting is a surprise on the fields! Hence, it is important to eliminate possibility of overfitting. There are various techniques to limit this problem. Few of the important ones are:

Train set, dev set, test set

The input data is split into 3 parts: The training set - that is used to train the hypothesis. Once this is done, the trained hypothesis is verified on the dev set. If there is no underfitting or overfitting, it is expected that the cost for the dev set would be similar to what was observed in the training set. If the cost for Dev set is higher, it means that the hypothesis is overfitting the training set. If either cost is much higher that acceptable, it means that the hypothesis underfits the available data. In such a case, we need to try out a richer hypothesis. Once the hypothesis is finalized by working on the train/dev set, then it should be tested on the test set - to make sure all is well. The golden rule for splitting the input data into these three was 60:20:20 %. But, lately, we have seen a huge burst in the available input data. When we have millions of records in the input data, it does not make sense to push 40% of it into dev/test sets. Hence, most of it is used in the train set, so long as we have a few thousand records for the dev/test set each

Cross Validation

Cross validation takes this a step further. Divide the available data into N folds. Then sequentially use each fold to test the training by the rest of the data as the test set. Thus we run the process N times - to get a much better result. Ofcourse this also requiers a lot more processing.

Weight Penalty

A common symptom of an overfitting curve is that some of the higher weights are too large. "Large and small" are very vague terms. But for any two models, if the training error is similar, one can reasonably guess that the one with larger values for higher weights is overfitted. In other weights, for the same error in the training, one can consider the "Error" is higher in the model with larger weights. To account for this we can make a small change to the error function and add a component that tends to penalize higher weights. Thus, the gradient descent will naturally move to the model with lower weights - thus avoiding overfitting.
Depending upon how the weight error is calculated, we have two types of weithg penalty - L1 & L2. L1 penalty is calculated by summing over the absolute value of the weights and the L2 penalty is calculated using their mean square value. Each has its own peculiarities and advantages. We cannot use derivatives with L1 penalty - so it cannot be used with simple gradient methods. It requires solving a convex optimization problem. But it has some major advantages. It drives some weights exactly to zero and it learns sparce models. This is particularly useful when we have too many features. Regularization based on L1 penalty helps us identify the important ones. L2 penalty is based on mean square sum of the weights. This is differentiable and can gracefully fit into the gradient descent algorithm.

Dimensionality reduction

It is often found that all the features are not orthogonally independent. In such cases, it does help to spend some time trying to rework the features into orthogonal features - that are much less in number. With the reduction in the number of features, the problem of gradient descent is much simpler and the chances of overfitting are significantly reduced.

Data augmentation

The root of overfitting is the limitted training data set - because the amount of data available is much less than what is required to train the model for all its features. Naturally there are two ways to solve this problem - reduce the number of features or increase the data. Dimensionality reduction takes the first approach and Data augmentation takes the second. It is often possible to use the existing data set to generate a larger amount of data to better train the model. Ofcourse it is not possible to generate more information that is already available in the available data. But it is possible to pump up the information using your knowledge about the domain and the model being trained. For example, image data can be easily augmented by flipping the images of sliding about etc - because we know how the images work.

Dropout

Another way to ensure that no single weight can drive the model crazy is skipping or forcing some weights to zero on every sample processed in the of the iteration. This makes sure each weight independently learns the information well enough to contribute to the net output rather than developing a tendency to drag it away.

Collect more data

If nothing else works, you have no choice but to go back and collect more data.