When you look carefully at PyTorch’s documentation of SGD, one can find that their implementation of Nesterov momentum has a couple of variations from the formulation discovered within the authentic paper. Most notably, PyTorch’s implementation evaluates the gradient on the present parameters, whereas the entire level of Nesterov momentum is to guage the gradient at shifted parameters. Sadly, it seems that dialogue about these discrepancies on the web is scarce. On this submit, we are going to study and clarify the variations between PyTorch’s implementation and the unique formulation of Nesterov momentum. Finally, we are going to see how PyTorch’s implementation shouldn’t be unsuitable, however somewhat an approximation, and speculate about the good thing about their implementation.
The unique paper describes Nesterov momentum utilizing the next replace guidelines:
the place v_{t+1} and θ_{t+1} are the speed vector and mannequin parameters respectively at time t, μ is the momentum issue, and ε is the educational price. The observe in PyTorch’s SGD documentation states they use the next replace guidelines:
the place g_{t+1} represents the gradient used to compute v_{t+1}. We will increase the replace rule for θ_{t+1} to get:
From this we are able to infer that:
and the replace guidelines turn into:
These are the replace guidelines that PyTorch makes use of in idea. I discussed earlier that PyTorch truly evaluates the gradient on the present parameters as an alternative of the shifted parameters. This may be seen by wanting on the algorithm description within the PyTorch SGD documentation. We are going to examine this additional afterward.
Be aware that for each the unique (1, 2) and PyTorch (3, 4) formulations, if v_0 = 0, then the primary replace to θ turns into:
Though the PyTorch SGD documentation observe states that the algorithm initializes the momentum buffer to the gradient at step one, we are going to later present that this means v_0 = 0.
There are two speedy variations when going from the unique (1, 2) to the PyTorch (3, 4) formulation:
- The training price is moved exterior of v_{t+1}.
- Within the replace rule for v_{t+1}, the time period involving the gradient is added as an alternative of subtracted, and within the replace rule for θ_{t+1}, the time period involving the speed vector is subtracted as an alternative of added. The distinction in signal contained in the gradient time period is solely a consequence of this as proven within the earlier part.
To know these variations, let’s first increase the replace guidelines. As hinted at right here, the impact of the primary distinction is extra obvious if we contemplate studying price schedules. So, we contemplate a generalization of the replace guidelines the place ε is now not fastened however can now fluctuate over time, and denote ε_t as the educational price at time step t. For brevity, let:
Assuming v_0 = 0, the unique formulation turns into:
and the PyTorch formulation turns into:
Within the authentic formulation (6), if the educational price have been to alter at time t, then solely the magnitude of the time period at i = t within the summation could be affected, and the magnitudes of all the opposite phrases would stay the identical. Because of this, the speedy affect of the educational price change is kind of restricted, and we must await the educational price change to “trickle” down over subsequent time steps to have a stronger affect on the general step measurement. In distinction, within the PyTorch formulation (7), if the educational price have been to alter at time t, then the magnitude of your complete step could be affected instantly.
For v_0 = 0, it’s clear from the expanded guidelines that the second distinction finally has no impact; in both formulation, the step works out to a reduced sum of gradients that’s subtracted from the present parameters.
Ignoring weight decay and dampening, by analyzing the SGD algorithm in PyTorch’s documentation, we are able to see that the applied replace guidelines are:
the place θ’_{t+1} are the mannequin parameters at time t and
We are going to consult with equations 3 and 4 because the PyTorch “observe” formulation, and equations 8 and 9 because the PyTorch “applied” formulation. We make a distinction between θ and θ’ for a purpose that may turn into obvious quickly. Probably the most obtrusive distinction from the observe formulation is that the gradient is evaluated on the present parameters somewhat than the shifted parameters. From this alone it could seem that the replace guidelines the algorithm implements shouldn’t be a correct implementation of Nesterov momentum.
We are going to now study how the PyTorch algorithm finally approximates Nesterov momentum. Derivations for an older model of PyTorch may be discovered right here from Ivo Danihelka, referenced on this GitHub situation. Derivations for the present model of PyTorch may be discovered right here, which is a comparatively simple adjustment from the earlier derivations. We offer a LaTeX rendering of those (re-derived) derivations right here for the reader’s comfort. The applied formulation is derived by a easy change of variables. Particularly, we let:
It instantly turns into clear that the observe replace rule for v_{t+1} (3) turns into equal to the applied replace rule for v_{t+1} (8) after the change of variables. We now need to derive an replace rule for θ’_{t+1} by way of θ’_t:
That is precisely the replace rule we noticed applied in PyTorch (9). At a excessive degree, the PyTorch implementation assumes the present parameters θ’_t are already the shifted model of the “precise” parameters θ_t. Therefore, at every time step, the “precise” parameters θ_t are associated to the present parameters θ’_t by:
Nonetheless, it seems from the supply code that the PyTorch SGD implementation doesn’t make any correction on the finish of the algorithm to retrieve the ultimate “precise” parameters, so the ultimate output is technically an approximation of the “precise” parameters.
Lastly, we now present that v_0 have to be 0:
Furthermore, we are able to affirm that the primary replace to the “precise” parameters is identical first replace made within the authentic formulation when v_0 = 0:
We will see that that is equal to equation 5.
In fact, the massive remaining query is: Why does PyTorch hassle in any respect to reformulate Nesterov momentum from equations 3 and 4 to equations 8 and 9? One attainable rationalization is that the reformulation may present some financial savings within the variety of arithmetic operations required. To guage this attainable rationalization, let’s depend the variety of arithmetic operations. For the observe formulation (3, 4), we’ve got:
Right here, there are a complete of seven operations. For the applied formulation (8, 9), we’ve got:
Right here, there are a complete of six operations. The second gradient within the PyTorch implementation simply makes use of the saved consequence from the primary gradient computation, so just one gradient computation is carried out at every time step. So, one obvious profit is that the PyTorch implementation cuts down on one extra multiplication operation at every step.
In abstract:
- The replace guidelines said in PyTorch’s SGD documentation observe (3, 4) have a unique location for the educational price in comparison with the unique Nesterov momentum replace guidelines (1, 2). This permits studying price schedules to have an instantaneous impact on the general step measurement, whereas the unique formulation would have the impact of studying price adjustments to “trickle” down over subsequent time steps.
- The replace guidelines applied within the PyTorch SGD algorithm (8, 9) are an approximation to the replace guidelines said within the documentation observe (3, 4) after a easy change of variables. Though the “precise” parameters are simply recoverable from the present parameters at every time step, the PyTorch implementation doesn’t make any such correction on the finish of the algorithm, and so the ultimate parameters technically stay an approximation of the “precise” closing parameters.
- An obvious good thing about the PyTorch implementation is that it avoids an extra multiplication operation at every time step.
- “SGD.” SGD — PyTorch 2.0 Documentation, pytorch.org/docs/steady/generated/torch.optim.SGD.html. Accessed 2 Sept. 2023.
- Sutskever, Ilya, et al. “On the significance of initialization and momentum in deep studying.” Worldwide Convention on Machine Studying. PMLR, 2013.
- Danihelka, Ivo. “Nesterov’s Momentum Made Easy.” 25 Aug. 2012.
- Chintala, Soumith. “nesterov momentum is unsuitable in sgd · Problem #27 · torch/optim.” GitHub, 13 Oct. 2014, github.com/torch/optim/points/27.
- Gross, Sam. “Add a observe within the docs concerning the momentum formulation utilized in optim · Problem #1099 · pytorch/pytorch.” GitHub, 25 Mar. 2017, github.com/pytorch/pytorch/points/1099#issuecomment-289190614.
- Zhao, Yilong. “repair Nesterov Momentum Bug · Problem #5920 · pytorch/pytorch.” GitHub, 21 Mar. 2018, https://github.com/pytorch/pytorch/pull/5920#issuecomment-375181908.