Pytorch: optim์—์„œ ์‚ฌ์šฉ๋œ ์šด๋™๋Ÿ‰ ๊ณต์‹์— ๋Œ€ํ•œ ๋ฌธ์„œ์— ๋ฉ”๋ชจ ์ถ”๊ฐ€

์— ๋งŒ๋“  2017๋…„ 03์›” 25์ผ  ยท  3์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: pytorch/pytorch

PyTorch์—์„œ SGD + Momentum์˜ ๊ตฌํ˜„์„ ์‚ดํŽด๋ณด์•˜๊ณ  ๋‹ค๋ฅธ ํŒจํ‚ค์ง€(๋ฐ ๋…ผ๋ฌธ)์—์„œ ์„ค๋ช…ํ•˜๋Š” ๋ฐฉ์‹๊ณผ ์•ฝ๊ฐ„ ๋‹ค๋ฅธ ์ ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹น๋ถ„๊ฐ„์€ Nesterov ๋ฒ„์ „์ด ์•„๋‹Œ (๊ณ ์ „์ ์ธ) ๋ชจ๋ฉ˜ํ…€์—๋งŒ ์ง‘์ค‘ํ•ฉ์‹œ๋‹ค.

์ž‘์„ฑ ๋‹น์‹œ ๊ตฌํ˜„ ๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

```
์šด๋™๋Ÿ‰ != 0์ธ ๊ฒฝ์šฐ:
param_state = self.state[p]
'momentum_buffer'๊ฐ€ param_state์— ์—†๋Š” ๊ฒฝ์šฐ:
buf = param_state['momentum_buffer'] = d_p.clone()
๋˜ ๋‹ค๋ฅธ:
buf = param_state['๋ชจ๋ฉ˜ํ…€_๋ฒ„ํผ']
buf.mul_(์šด๋™๋Ÿ‰).add_(1 - ๊ฐ์‡ , d_p)
๋„ค์Šคํ…Œ๋กœํ”„์˜ ๊ฒฝ์šฐ:
d_p = d_p.add(๋ชจ๋ฉ˜ํ…€, ๋ฒ„ํ”„)
๋˜ ๋‹ค๋ฅธ:
d_p = ๋ฒ„ํ”„

            p.data.add_(-group['lr'], d_p)
Mathematically, if we denote the momentum buffer by `v` and assume that `dampening=0`, at every iteration, the buffer is updated as `v = m*v + g` and the step is `โˆ†x = lr * v`. Notice that the learning rate `lr` hits the momentum term `v` as well as the gradient. To me, this is different from what classical momentum is, and also differs from how other packages implement SGD+M.

Let us contrast this with the Sutskever et. al. paper and other commonly used pacakges such as Lasagne, Keras, Neon, etc.

## [Sutskever et. al.](http://www.jmlr.org/proceedings/papers/v28/sutskever13.pdf)
The snippet of the relevant section is pasted below. 
![Sutskever et. al.](http://i.imgur.com/QJelodE.png)

Retaining the syntax from above, the algorithm updates `v` as `v = m*v - lr * g` with the step `โˆ†x = v`. So, the learning rate `lr` only hits the gradient. It does not (explicitly) influence the effect of the momentum term which is in contrast with PyTorch's implementation. 

# [Lasagne](https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py#L217)

Lasagne employs the same rule as suggested in Sutskever for momentum. 

for param in params:
    value = param.get_value(borrow=True)
    velocity = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                             broadcastable=param.broadcastable)
    x = momentum * velocity + updates[param]
    updates[velocity] = x - param
# [Keras](https://github.com/fchollet/keras/blob/master/keras/optimizers.py#L141)

Same for Keras:
   for p, g, m in zip(params, grads, moments):
        v = self.momentum * m - lr * g  # velocity
        self.updates.append(K.update(m, v))

        if self.nesterov:
            new_p = p + self.momentum * v - lr * g
        else:
            new_p = p + v
# [Neon](https://github.com/NervanaSystems/neon/blob/master/neon/optimizers/optimizer.py#L520)

and Neon.
            velocity[:] = self.momentum_coef * velocity - lrate * grad

            # Nesterov accelerated gradient (NAG) is implemented the same
            # as in torch's "sgd.lua". It's a reformulation of Sutskever's
            # NAG equation found in "On the importance of initialization
            # and momentum in deep learning".
            if self.nesterov:
                param[:] = param + self.momentum_coef * velocity -\
                           lrate * grad
            else:
                param[:] = param + velocity

```
๊ฒฉ์ฐจ๊ฐ€ ์‚ฌ์‹ค์ž…๋‹ˆ๊นŒ ์•„๋‹ˆ๋ฉด ์ค‘์š”ํ•œ ๊ฒƒ์„ ๋†“์น˜๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ?

๋‘ ๊ตฌํ˜„ ๊ฐ„์˜ ์ฐจ์ด๋Š” ์ค‘์š”ํ•˜์ง€ ์•Š์œผ๋ฉฐ ํŠนํžˆ lr ๊ฐ€ ๋„์ค‘์— ์ค„์–ด๋“ค ๋•Œ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค. ๋‚ด ์ฃผ์žฅ์ด ์‚ฌ์‹ค์ด๋ผ๋ฉด ์ฐธ์กฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๊ฑฐ๋‚˜(๋ฌด์—‡์ธ์ง€ ํ™•์‹คํ•˜์ง€ ์•Š์Œ) SGD ์ฝ”๋“œ์— ์œ„ ๋ฒ„์ „์„ ํฌํ•จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ(ํ•„์š”ํ•œ ๊ฒฝ์šฐ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Œ)?

medium priority (this tag is deprecated)

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

๊ณ ์ • ํ•™์Šต๋ฅ ์˜ ๊ฒฝ์šฐ ๋‘ ๊ณต์‹์€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ๋‹จ๊ณ„ ํฌ๊ธฐ๊ฐ€ ํ•™์Šต๋ฅ ์— ์ •๋น„๋ก€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— Torch ๊ณต์‹์ด ์„ ํƒ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ•™์Šต๋ฅ ์„ ์ค„์ด๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ์›ํ•˜๋Š” ๋ฐ˜๋ณต ํšŸ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ผ ๋‹จ๊ณ„ ํฌ๊ธฐ๊ฐ€ ์ฆ‰์‹œ ๊ฐ์†Œํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  3 ๋Œ“๊ธ€

๊ณ ์ • ํ•™์Šต๋ฅ ์˜ ๊ฒฝ์šฐ ๋‘ ๊ณต์‹์€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ๋‹จ๊ณ„ ํฌ๊ธฐ๊ฐ€ ํ•™์Šต๋ฅ ์— ์ •๋น„๋ก€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— Torch ๊ณต์‹์ด ์„ ํƒ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ•™์Šต๋ฅ ์„ ์ค„์ด๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ์›ํ•˜๋Š” ๋ฐ˜๋ณต ํšŸ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ผ ๋‹จ๊ณ„ ํฌ๊ธฐ๊ฐ€ ์ฆ‰์‹œ ๊ฐ์†Œํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ๋™์˜ํ•œ๋‹ค. ๋‚ด ์œ ์ผํ•œ ๊ด€์‹ฌ์‚ฌ๋Š” ์ด ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ฐธ์กฐ๊ฐ€ Sutskever ๋…ผ๋ฌธ์ด๊ณ  ์ฐจ์ด์ ์„ ์„ค๋ช…ํ•˜๋Š” ๋ฌธ์„œ๊ฐ€ ์—†๋‹ค๋Š” ์ ์„ ๊ฐ์•ˆํ•  ๋•Œ ํ˜„์žฌ ๊ตฌํ˜„์ด ๋‹ค๋ฅธ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ PyTorch๋กœ ์ด๋™ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ์ž ์žฌ์ ์ธ _"๊ณค๊ฒฝ"์ด ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@keskarnitish ๋ฌธ์„œ์— ๋ฉ”๋ชจ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ PR์„ ๋ณด๋‚ด๋ฉด ๊ธฐ๊บผ์ด ๋ณ‘ํ•ฉํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰