yorhaha

yorhaha

Deep learning optimizer

Optimizer#

Backpropagation#

The main idea of the backpropagation algorithm is to compute the gradient of the loss function with respect to the model parameters, and then update the parameters using gradient descent to minimize the loss function, gradually approaching the optimal solution.

The implementation process includes two key steps: forward propagation and backward propagation. Forward propagation calculates the output of the network by passing the network input through each layer, while backward propagation uses the chain rule to propagate the gradient of the loss function to each parameter in the network to update the parameters. This iterative process allows the neural network to gradually learn the mapping relationship between the input and output, thereby improving the predictive performance of the network.

The basic idea of the backpropagation algorithm is to propagate the loss function backwards through the network, calculating and accumulating gradients layer by layer from the output layer to the input layer. Specifically, the algorithm starts from the last layer of the network, calculates the error gradient of the output layer, and then propagates this gradient to the previous layer, iterating this process until it reaches the input layer. At each layer, according to the chain rule, the gradient of the current layer is multiplied by the weight of that layer and then passed to the previous layer.

The steps of the backpropagation algorithm are as follows:

  1. Forward propagation: Pass the input data through the network to calculate the output.
  2. Calculate the loss: Compare the output with the true labels to calculate the value of the loss function.
  3. Backward propagation: Starting from the output layer, calculate the gradient of the output layer based on the loss function, and then propagate the gradient forward to calculate the gradient of each layer.
  4. Parameter update: Use gradient descent or other optimization algorithms to update the network parameters based on the calculated gradients.
  5. Repeat steps 1-4 until the stopping condition is met (such as reaching the maximum number of iterations or convergence of the loss function).

Batch gradient descent (GD)#

θt=θt1αθ1Ni=1NJ(θ;x(i))\boldsymbol{\theta}_{\boldsymbol{t}}=\boldsymbol{\theta}_{\boldsymbol{t}-\mathbf{1}}-\alpha \nabla_{\boldsymbol{\theta}} \frac{1}{N} \sum_{i=1}^{N} J\left(\boldsymbol{\theta} ; x^{(i)}\right)

NN : all samples

Stochastic gradient descent (SGD)#

θt=θt1αθ1BSi=1BSJ(θ;x(i))\boldsymbol{\theta}_{\boldsymbol{t}}=\boldsymbol{\theta}_{\boldsymbol{t}-\mathbf{1}}-\alpha \nabla_{\boldsymbol{\theta}} \frac{1}{\text{BS}} \sum_{i=1}^{\text{BS}} J\left(\boldsymbol{\theta} ; x^{(i)}\right)

BS\text{BS} : mini-batch

Adagrad#

vt=vt1+gt2θt=θt1αgtvt+ϵv_t=v_{t-1}+g_t^2\quad \theta_t=\theta_{t-1}-\alpha\frac{g_t}{\sqrt{v_t+\epsilon}}

RMSProp#

vt=γvt1+(1γ)gt2θt=θt1αgtvt+ϵv_t=\gamma v_{t-1}+(1-\gamma)g_t^2 \quad \theta_t=\theta_{t-1}-\alpha\frac{g_t}{\sqrt{v_t+\epsilon}}

Adam#

mt=β1mt1+(1β1)gtvt=β2vt1+(1β2)gt2θt=θt1αmtvt+ϵm_t=\beta_1 m_{t-1} + (1-\beta_1)g_t\qquad v_t=\beta_2 v_{t-1}+(1-\beta_2)g_t^2\qquad \theta_t=\theta_{t-1}-\alpha\frac{m_t}{\sqrt{v_t+\epsilon}}

ϵ=109,β1=0.9,β2=0.999\epsilon=10^{-9},\beta_1=0.9, \beta_2=0.999

Warm up:

m^t=mt/(1β1t)v^t=vt/(1β2t)\hat{m}_t=m_t/(1-\beta_1^t)\qquad \hat{v}_t=v_t/(1-\beta_2^t)
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.