PyTorch์์ SGD + Momentum์ ๊ตฌํ์ ์ดํด๋ณด์๊ณ ๋ค๋ฅธ ํจํค์ง(๋ฐ ๋ ผ๋ฌธ)์์ ์ค๋ช ํ๋ ๋ฐฉ์๊ณผ ์ฝ๊ฐ ๋ค๋ฅธ ์ ์ ๋ฐ๊ฒฌํ์ต๋๋ค. ๋น๋ถ๊ฐ์ Nesterov ๋ฒ์ ์ด ์๋ (๊ณ ์ ์ ์ธ) ๋ชจ๋ฉํ ์๋ง ์ง์คํฉ์๋ค.
์์ฑ ๋น์ ๊ตฌํ ๋ด์ฉ์ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
```
์ด๋๋ != 0์ธ ๊ฒฝ์ฐ:
param_state = self.state[p]
'momentum_buffer'๊ฐ param_state์ ์๋ ๊ฒฝ์ฐ:
buf = param_state['momentum_buffer'] = d_p.clone()
๋ ๋ค๋ฅธ:
buf = param_state['๋ชจ๋ฉํ
_๋ฒํผ']
buf.mul_(์ด๋๋).add_(1 - ๊ฐ์ , d_p)
๋ค์คํ
๋กํ์ ๊ฒฝ์ฐ:
d_p = d_p.add(๋ชจ๋ฉํ
, ๋ฒํ)
๋ ๋ค๋ฅธ:
d_p = ๋ฒํ
p.data.add_(-group['lr'], d_p)
Mathematically, if we denote the momentum buffer by `v` and assume that `dampening=0`, at every iteration, the buffer is updated as `v = m*v + g` and the step is `โx = lr * v`. Notice that the learning rate `lr` hits the momentum term `v` as well as the gradient. To me, this is different from what classical momentum is, and also differs from how other packages implement SGD+M.
Let us contrast this with the Sutskever et. al. paper and other commonly used pacakges such as Lasagne, Keras, Neon, etc.
## [Sutskever et. al.](http://www.jmlr.org/proceedings/papers/v28/sutskever13.pdf)
The snippet of the relevant section is pasted below.
![Sutskever et. al.](http://i.imgur.com/QJelodE.png)
Retaining the syntax from above, the algorithm updates `v` as `v = m*v - lr * g` with the step `โx = v`. So, the learning rate `lr` only hits the gradient. It does not (explicitly) influence the effect of the momentum term which is in contrast with PyTorch's implementation.
# [Lasagne](https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py#L217)
Lasagne employs the same rule as suggested in Sutskever for momentum.
for param in params:
value = param.get_value(borrow=True)
velocity = theano.shared(np.zeros(value.shape, dtype=value.dtype),
broadcastable=param.broadcastable)
x = momentum * velocity + updates[param]
updates[velocity] = x - param
# [Keras](https://github.com/fchollet/keras/blob/master/keras/optimizers.py#L141)
Same for Keras:
for p, g, m in zip(params, grads, moments):
v = self.momentum * m - lr * g # velocity
self.updates.append(K.update(m, v))
if self.nesterov:
new_p = p + self.momentum * v - lr * g
else:
new_p = p + v
# [Neon](https://github.com/NervanaSystems/neon/blob/master/neon/optimizers/optimizer.py#L520)
and Neon.
velocity[:] = self.momentum_coef * velocity - lrate * grad
# Nesterov accelerated gradient (NAG) is implemented the same
# as in torch's "sgd.lua". It's a reformulation of Sutskever's
# NAG equation found in "On the importance of initialization
# and momentum in deep learning".
if self.nesterov:
param[:] = param + self.momentum_coef * velocity -\
lrate * grad
else:
param[:] = param + velocity
```
๊ฒฉ์ฐจ๊ฐ ์ฌ์ค์
๋๊น ์๋๋ฉด ์ค์ํ ๊ฒ์ ๋์น๊ณ ์์ต๋๊น?
๋ ๊ตฌํ ๊ฐ์ ์ฐจ์ด๋ ์ค์ํ์ง ์์ผ๋ฉฐ ํนํ lr
๊ฐ ๋์ค์ ์ค์ด๋ค ๋ ๊ทธ๋ ์ต๋๋ค. ๋ด ์ฃผ์ฅ์ด ์ฌ์ค์ด๋ผ๋ฉด ์ฐธ์กฐ๋ฅผ ์
๋ฐ์ดํธํ๊ฑฐ๋(๋ฌด์์ธ์ง ํ์คํ์ง ์์) SGD ์ฝ๋์ ์ ๋ฒ์ ์ ํฌํจํ ์ ์์ต๋๊น(ํ์ํ ๊ฒฝ์ฐ ์ ํํ ์ ์์)?
๊ณ ์ ํ์ต๋ฅ ์ ๊ฒฝ์ฐ ๋ ๊ณต์์ ๋์ผํฉ๋๋ค. ๋จ๊ณ ํฌ๊ธฐ๊ฐ ํ์ต๋ฅ ์ ์ ๋น๋กํ๊ธฐ ๋๋ฌธ์ Torch ๊ณต์์ด ์ ํ๋ฉ๋๋ค. ์ด๋ ํ์ต๋ฅ ์ ์ค์ด๋ฉด ์ผ๋ฐ์ ์ผ๋ก ์ํ๋ ๋ฐ๋ณต ํ์๊ฐ ์๋๋ผ ๋จ๊ณ ํฌ๊ธฐ๊ฐ ์ฆ์ ๊ฐ์ํ๋ค๋ ๊ฒ์ ์๋ฏธํฉ๋๋ค.
๋๋ ๋์ํ๋ค. ๋ด ์ ์ผํ ๊ด์ฌ์ฌ๋ ์ด ๋ฐฉ๋ฒ์ ๋ํ ์ฐธ์กฐ๊ฐ Sutskever ๋ ผ๋ฌธ์ด๊ณ ์ฐจ์ด์ ์ ์ค๋ช ํ๋ ๋ฌธ์๊ฐ ์๋ค๋ ์ ์ ๊ฐ์ํ ๋ ํ์ฌ ๊ตฌํ์ด ๋ค๋ฅธ ํ๋ ์์ํฌ์์ PyTorch๋ก ์ด๋ํ๋ ์ฌ๋๋ค์๊ฒ ์ ์ฌ์ ์ธ _"๊ณค๊ฒฝ"์ด ๋ ์ ์๋ค๋ ๊ฒ์ ๋๋ค.
@keskarnitish ๋ฌธ์์ ๋ฉ๋ชจ๋ฅผ ์ถ๊ฐํ์ฌ PR์ ๋ณด๋ด๋ฉด ๊ธฐ๊บผ์ด ๋ณํฉํ๊ฒ ์ต๋๋ค.
๊ฐ์ฅ ์ ์ฉํ ๋๊ธ
๊ณ ์ ํ์ต๋ฅ ์ ๊ฒฝ์ฐ ๋ ๊ณต์์ ๋์ผํฉ๋๋ค. ๋จ๊ณ ํฌ๊ธฐ๊ฐ ํ์ต๋ฅ ์ ์ ๋น๋กํ๊ธฐ ๋๋ฌธ์ Torch ๊ณต์์ด ์ ํ๋ฉ๋๋ค. ์ด๋ ํ์ต๋ฅ ์ ์ค์ด๋ฉด ์ผ๋ฐ์ ์ผ๋ก ์ํ๋ ๋ฐ๋ณต ํ์๊ฐ ์๋๋ผ ๋จ๊ณ ํฌ๊ธฐ๊ฐ ์ฆ์ ๊ฐ์ํ๋ค๋ ๊ฒ์ ์๋ฏธํฉ๋๋ค.