Before starting with this blog you should know Concepts of underfitting, overfitting, bias and variance in machine learning .
Regularization is one of the basic and most important concept in machine learning. We know that overfitting of models is tends to low accuracy and high error.
And this happens because the model is trying too hard to capture the noise and unnecessary data in the training dataset.
The noise is basically data points that don’t really represent the true properties of the data but represent random data and this lead high variance (variance is a prediction error) and low bias.
In overfitting we get high error for testing data and less error for training, it happens because our model not generalized test data or unseen data.
To avoid overfitting or to deal with it we have several techniques for example cross-validation on test data , bias-variance balance, ensemble algorithms and regularization etc...
Here we will discuss regularization technique to deal with problem of overfitting.
Regularization means to make things regular or acceptable. This is exactly what we are doing here. We know overfitting occurs mostly when we try to train a complex model the regularization in simple terms try to discourage learning a more complex or flexible model, so as to avoid the risk of overfitting.
By definition regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.
How it works ?
To understand regularization let's take a look to simple linear regression equation.
where Y represent the predicted value
β represent weights or magnitude attached to the features or coefficient estimate.
β0 represent the bias of model
X1, X2...Xp are features
The loss function for this fitting procedure is residual sum of square (RSS)
Now here we are trying to adjust the coefficients or weights based on training data as we adjust these coefficients based on training data these coefficients won’t generalize well to the test data or random data.
And this is the situation where we use generalization to shrinks or regularizes these learned estimates towards zero. This leads in optimizing parameters (weights and bias) so as to reduce RSS (error) to predict Y as actual value.
There are two types of regularization techniques that is
- Ridge Regression
- Lasso Regression
Ridge regression is one of the type of linear regression where RSS is modified by adding the shrinkage quantity. consider the formula given below (Ridge regression cost function)
In the formula you can see that the formula is similar to RSS (Residual square sum) with some additional terms in it that is λ multiplied with sum of squared coefficients of each individual feature.
This additional term here is known as penalty term and the amount of bias added here is known as Ridge Regression penalty.
By adding penalty term we basically regularizes the coefficients of the model, and hence ridge regression reduces the amplitudes of the coefficients that decreases the complexity of the model.
If λ tend to zero, the equation of cost function will become equation of cost function of linear regression. Hence, the value of λ plays an important role here and can be seen that selecting a good value of λ is critical.
It is also called as L2 regularization and used when there is a lot of parameters and collinearity in model.
Lasso is also a type of regression stand for Least Absolute Shrinkage and Selection Operator. See the cost function for Lasso regression.
As you can see it is similar to Ridge Regression except that the penalty term contains only the absolute weights instead of a square of weights.
It is also known as L1 regularization. Lasso regression also used to reduce overfitting and regularize the coefficient.
The difference between ridge and lasso regression is that it tends to make coefficients to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero.