Regularization methods

Regularization#

Regularization is a technique used in deep learning to reduce overfitting of the model. It achieves this by adding a regularization term to the objective function, which introduces a penalty on the complexity of the model to constrain the size of the parameters or the relationships between the parameters, thereby improving generalization and performance on unseen data.

Overfitting and Underfitting#

Overfitting#

Overfitting refers to the situation where the model fits the training data too well but performs poorly on new data. This is because the model is too complex and pays too much attention to the details and noise in the training data, making it unable to generalize well to unseen data.

Underfitting#

Underfitting refers to the situation where the model fails to fit the training data well and fails to capture the key features and patterns in the data, resulting in poor performance on both the training data and new data.

To address underfitting, we can take the following steps:

Increase model complexity
Perform feature engineering
Increase the amount of training data
Reduce regularization
Use more complex models

L1 Regularization (Lasso)#

L1 regularization is achieved by adding an L1 norm penalty term to the model's loss function. The L1 norm refers to the sum of the absolute values of the model's weight parameters. The goal of L1 regularization is to minimize the sum of the loss function and the L1 norm penalty term. Its effect is to compress some weight parameters to zero, thereby achieving feature selection and sparsity.

\hat{L}(\theta)=L(\theta)+\lambda \|\theta\|_1

The gradient is increased by a constant term; by subtracting a small portion from the features that have not been provided information in each iteration, their weights are forced to be zero, eventually making them zero. It encourages the model to have smaller coefficients and may lead to sparsity.

L2 Regularization (Ridge)#

L2 regularization is achieved by adding an L2 norm penalty term to the model's loss function. The L2 norm refers to the square root of the sum of the squares of the model's weight parameters. The goal of L2 regularization is to minimize the sum of the loss function and the L2 norm penalty term. Its effect is to decrease the values of the weight parameters, making the model's weight parameters smoother and more stable.

\hat{L}(\theta)=L(\theta)+\lambda \|\theta\|^2/2

The effect is similar to weight decay. It encourages the model to have smaller coefficients but does not lead to sparsity.

	L1	L2
Effect	Can produce sparser models	Helps with ill-conditioned problems
Prior Probability	Parameters follow a Laplace distribution	Parameters follow a Gaussian prior distribution

Dropout#

Dropout regularization is achieved by randomly setting a portion of the neuron outputs to zero during training. Specifically, each neuron has a certain probability of being dropped out, making the model unable to rely on specific neurons and enhancing the model's generalization ability.

(1) Averaging effect: Returning to the standard model without dropout, if we train 5 different neural networks with the same training data, we generally get 5 different results. In this case, we can use the strategy of "averaging the 5 results" or "majority voting" to determine the final result. For example, if 3 networks predict the number 9, it is very likely that the true result is the number 9, and the other 2 networks have given incorrect results. This "averaging" strategy can effectively prevent overfitting problems because different networks may produce different overfitting, and averaging can cancel out some "opposite" fitting. Dropping different hidden neurons is similar to training different networks, randomly dropping half of the hidden neurons changes the network structure, and the entire dropout process is equivalent to averaging many different neural networks. Different networks produce different overfitting, and some "opposite" fitting can cancel each other out, resulting in an overall reduction in overfitting.

(2) Reducing complex co-adaptation between neurons: Dropout programs cause two neurons to not appear in the same dropout network every time. This means that the update of the weights no longer depends on the joint action of hidden nodes with fixed relationships, preventing the situation where some features are effective only under other specific features. It forces the network to learn more robust features that also exist in random subsets of other neurons. In other words, if our neural network is making a certain prediction, it should not be too sensitive to certain clue fragments, even if specific clues are lost, it should be able to learn some common features from many other clues. From this perspective, dropout is somewhat similar to L1 and L2 regularization, reducing the weight to improve the robustness of the network to the loss of specific neuron connections.

(3) Dropout is similar to the role of gender in biological evolution: Species tend to adapt to the environment in order to survive. Environmental changes can make it difficult for species to respond in a timely manner. The emergence of gender can produce variants that adapt to new environments, effectively preventing overfitting and avoiding extinction that species may face when the environment changes.

Vanilla Dropout: Multiply the output by $1-p$ during testing.
Inverted Dropout: (Mainstream method) Multiply the data by $1/(1-p)$ during training.

import numpy as np

class Dropout:
    def __init__(self, dropout_rate):
        self.dropout_rate = dropout_rate
        self.mask = None

    def forward(self, x, is_train):
        if is_train:
            self.mask = np.random.binomial(1, 1 - self.dropout_rate, size=x.shape) / (1 - self.dropout_rate)
            out = x * self.mask
        else:
            out = x
        return out

    def backward(self, dout):
        dx = dout * self.mask
        return dx

Early Stopping#

Early stopping is a strategy based on the performance of the model on the validation set to stop the training of the model early. By monitoring the performance metrics (such as loss function or accuracy) of the model on the validation set, if the performance of the model does not improve within a certain number of training epochs, the training is stopped early to avoid overfitting.

Batch Normalization#

Batch Normalization normalizes the inputs of each batch to keep the distribution of the intermediate layer inputs small. It speeds up the training process of the network by reducing the internal covariance shift and scaling of the input data, and helps prevent gradient vanishing or exploding.

It is suitable for cases where the batch size is large and the sequence length is fixed, such as CNN.

\begin{array}{l} \mu_{\mathcal{B}}\leftarrow \frac{1}{m} \sum_{i=1}^{m} x_{i}\qquad\text { // mini-batch mean } \\ \sigma_{\mathcal{B}}^{2}\leftarrow \frac{1}{m} \sum_{i=1}^{m}\left(x_{i}-\mu_{\mathcal{B}}\right)^{2} \qquad\text { // mini-batch variance } \\ \widehat{x}_{i}\leftarrow \frac{x_{i}-\mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^{2}+\epsilon}} \qquad\text { // normalize }\\ y_{i}\leftarrow \gamma \widehat{x}_{i}+\beta \equiv \mathrm{BN}_{\gamma, \beta}\left(x_{i}\right) \qquad\text { // scale and shift } \end{array}

$\gamma$ and $\beta$ are the scaling factor and offset, respectively. Subtracting the mean and dividing by the standard deviation may not be the best distribution. Therefore, two learnable variables are added to improve the data distribution for better results.

During training, the mean and variance are collected using a moving average, while during testing, they are used directly.

import numpy as np

class BatchNormalization:
    def __init__(self, epsilon=1e-5, momentum=0.9):
        self.epsilon = epsilon
        self.momentum = momentum
        self.running_mean = None
        self.running_var = None
        self.gamma = None
        self.beta = None

    def forward(self, X, training=True):
        N, D = X.shape

        if self.running_mean is None:
            self.running_mean = np.zeros(D)
        if self.running_var is None:
            self.running_var = np.zeros(D)

        if training:
            sample_mean = np.mean(X, axis=0)
            sample_var = np.var(X, axis=0)

            X_normalized = (X - sample_mean) / np.sqrt(sample_var + self.epsilon)

            self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * sample_mean
            self.running_var = self.momentum * self.running_var + (1 - self.momentum) * sample_var

            self.gamma = np.ones(D)
            self.beta = np.zeros(D)

        else:
            X_normalized = (X - self.running_mean) / np.sqrt(self.running_var + self.epsilon)

        out = self.gamma * X_normalized + self.beta
        return out

What problems does BN solve?

Neural networks experience a shift in the distribution of activation input values before performing nonlinear transformations as the network deepens (internal covariate shift). The reason for slow convergence during training is generally that the overall distribution gradually approaches the ends of the nonlinear function, resulting in the vanishing gradients in the backpropagation of the lower layers of the network. BN forcibly pulls back the distribution of any neuron input values to a standard normal distribution with a mean of 0 and a variance of 1 through certain regularization methods.

Layer Normalization#

Layer Normalization is suitable for variable-length sequences, such as RNN and Transformer.

If BN is applied to NLP tasks, it is equivalent to assuming that words at the same position correspond to the same feature.

Others#

Elastic Net regularization: A method that combines L1 regularization and L2 regularization. It introduces both L1 norm and L2 norm penalty terms in the loss function and balances the regularization effect by adjusting the weights of the two terms. Elastic Net regularization can achieve both feature selection and weight shrinkage effects.
Data augmentation is a technique that increases the number of data samples by applying a series of random transformations or augmentations to the original training data. These transformations can include random rotation, translation, scaling, flipping, etc. Data augmentation can help the model generalize better and alleviate the problem of overfitting.
Parameter sharing is a technique that shares some parameters between layers in a neural network. Sharing parameters between layers with similar structures or functions can reduce the number of model parameters, thereby reducing the complexity of the model and the risk of overfitting. Parameter sharing is commonly used in convolutional neural networks (CNNs).