Deep learning optimizer

Optimizer#

Backpropagation#

The main idea of the backpropagation algorithm is to compute the gradient of the loss function with respect to the model parameters, and then update the parameters using gradient descent to minimize the loss function, gradually approaching the optimal solution.

The implementation process includes two key steps: forward propagation and backward propagation. Forward propagation calculates the output of the network by passing the network input through each layer, while backward propagation uses the chain rule to propagate the gradient of the loss function to each parameter in the network to update the parameters. This iterative process allows the neural network to gradually learn the mapping relationship between the input and output, thereby improving the predictive performance of the network.

The basic idea of the backpropagation algorithm is to propagate the loss function backwards through the network, calculating and accumulating gradients layer by layer from the output layer to the input layer. Specifically, the algorithm starts from the last layer of the network, calculates the error gradient of the output layer, and then propagates this gradient to the previous layer, iterating this process until it reaches the input layer. At each layer, according to the chain rule, the gradient of the current layer is multiplied by the weight of that layer and then passed to the previous layer.

The steps of the backpropagation algorithm are as follows:

Forward propagation: Pass the input data through the network to calculate the output.
Calculate the loss: Compare the output with the true labels to calculate the value of the loss function.
Backward propagation: Starting from the output layer, calculate the gradient of the output layer based on the loss function, and then propagate the gradient forward to calculate the gradient of each layer.
Parameter update: Use gradient descent or other optimization algorithms to update the network parameters based on the calculated gradients.
Repeat steps 1-4 until the stopping condition is met (such as reaching the maximum number of iterations or convergence of the loss function).

Batch gradient descent (GD)#

\boldsymbol{\theta}_{\boldsymbol{t}}=\boldsymbol{\theta}_{\boldsymbol{t}-\mathbf{1}}-\alpha \nabla_{\boldsymbol{\theta}} \frac{1}{N} \sum_{i=1}^{N} J\left(\boldsymbol{\theta} ; x^{(i)}\right)

$N$ : all samples

Stochastic gradient descent (SGD)#

\boldsymbol{\theta}_{\boldsymbol{t}}=\boldsymbol{\theta}_{\boldsymbol{t}-\mathbf{1}}-\alpha \nabla_{\boldsymbol{\theta}} \frac{1}{\text{BS}} \sum_{i=1}^{\text{BS}} J\left(\boldsymbol{\theta} ; x^{(i)}\right)

$\text{BS}$ : mini-batch

Adagrad#

v_t=v_{t-1}+g_t^2\quad \theta_t=\theta_{t-1}-\alpha\frac{g_t}{\sqrt{v_t+\epsilon}}

RMSProp#

v_t=\gamma v_{t-1}+(1-\gamma)g_t^2 \quad \theta_t=\theta_{t-1}-\alpha\frac{g_t}{\sqrt{v_t+\epsilon}}

Adam#

m_t=\beta_1 m_{t-1} + (1-\beta_1)g_t\qquad v_t=\beta_2 v_{t-1}+(1-\beta_2)g_t^2\qquad \theta_t=\theta_{t-1}-\alpha\frac{m_t}{\sqrt{v_t+\epsilon}}

$\epsilon=10^{-9},\beta_1=0.9, \beta_2=0.999$

Warm up:

\hat{m}_t=m_t/(1-\beta_1^t)\qquad \hat{v}_t=v_t/(1-\beta_2^t)