Pytorch: optim에서 사용된 운동량 공식에 대한 문서에 메모 추가

에 만든 2017년 03월 25일 · 3코멘트 · 출처: pytorch/pytorch

PyTorch에서 SGD + Momentum의 구현을 살펴보았고 다른 패키지(및 논문)에서 설명하는 방식과 약간 다른 점을 발견했습니다. 당분간은 Nesterov 버전이 아닌 (고전적인) 모멘텀에만 집중합시다.

작성 당시 구현 내용은 다음과 같습니다.

```
운동량 != 0인 경우:
param_state = self.state[p]
'momentum_buffer'가 param_state에 없는 경우:
buf = param_state['momentum_buffer'] = d_p.clone()
또 다른:
buf = param_state['모멘텀_버퍼']
buf.mul_(운동량).add_(1 - 감쇠, d_p)
네스테로프의 경우:
d_p = d_p.add(모멘텀, 버프)
또 다른:
d_p = 버프

            p.data.add_(-group['lr'], d_p)

Mathematically, if we denote the momentum buffer by `v` and assume that `dampening=0`, at every iteration, the buffer is updated as `v = m*v + g` and the step is `∆x = lr * v`. Notice that the learning rate `lr` hits the momentum term `v` as well as the gradient. To me, this is different from what classical momentum is, and also differs from how other packages implement SGD+M.

Let us contrast this with the Sutskever et. al. paper and other commonly used pacakges such as Lasagne, Keras, Neon, etc.

## [Sutskever et. al.](http://www.jmlr.org/proceedings/papers/v28/sutskever13.pdf)
The snippet of the relevant section is pasted below. 
![Sutskever et. al.](http://i.imgur.com/QJelodE.png)

Retaining the syntax from above, the algorithm updates `v` as `v = m*v - lr * g` with the step `∆x = v`. So, the learning rate `lr` only hits the gradient. It does not (explicitly) influence the effect of the momentum term which is in contrast with PyTorch's implementation. 

# [Lasagne](https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py#L217)

Lasagne employs the same rule as suggested in Sutskever for momentum.

for param in params:
    value = param.get_value(borrow=True)
    velocity = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                             broadcastable=param.broadcastable)
    x = momentum * velocity + updates[param]
    updates[velocity] = x - param

# [Keras](https://github.com/fchollet/keras/blob/master/keras/optimizers.py#L141)

Same for Keras:

   for p, g, m in zip(params, grads, moments):
        v = self.momentum * m - lr * g  # velocity
        self.updates.append(K.update(m, v))

        if self.nesterov:
            new_p = p + self.momentum * v - lr * g
        else:
            new_p = p + v

# [Neon](https://github.com/NervanaSystems/neon/blob/master/neon/optimizers/optimizer.py#L520)

and Neon.

            velocity[:] = self.momentum_coef * velocity - lrate * grad

            # Nesterov accelerated gradient (NAG) is implemented the same
            # as in torch's "sgd.lua". It's a reformulation of Sutskever's
            # NAG equation found in "On the importance of initialization
            # and momentum in deep learning".
            if self.nesterov:
                param[:] = param + self.momentum_coef * velocity -\
                           lrate * grad
            else:
                param[:] = param + velocity

```
격차가 사실입니까 아니면 중요한 것을 놓치고 있습니까?

두 구현 간의 차이는 중요하지 않으며 특히 lr 가 도중에 줄어들 때 그렇습니다. 내 주장이 사실이라면 참조를 업데이트하거나(무엇인지 확실하지 않음) SGD 코드에 위 버전을 포함할 수 있습니까(필요한 경우 선택할 수 있음)?

medium priority (this tag is deprecated)

출처

keskarnitish

👍4

가장 유용한 댓글

고정 학습률의 경우 두 공식은 동일합니다. 단계 크기가 학습률에 정비례하기 때문에 Torch 공식이 선택됩니다. 이는 학습률을 줄이면 일반적으로 원하는 반복 횟수가 아니라 단계 크기가 즉시 감소한다는 것을 의미합니다.

colesbury 에 2017년 03월 25일

👍3

모든 3 댓글

colesbury 에 2017년 03월 25일

👍3

나는 동의한다. 내 유일한 관심사는 이 방법에 대한 참조가 Sutskever 논문이고 차이점을 설명하는 문서가 없다는 점을 감안할 때 현재 구현이 다른 프레임워크에서 PyTorch로 이동하는 사람들에게 잠재적인 _"곤경"이 될 수 있다는 것입니다.

keskarnitish 에 2017년 03월 25일

👍1

@keskarnitish 문서에 메모를 추가하여 PR을 보내면 기꺼이 병합하겠습니다.

soumith 에 2017년 04월 05일

👍1

이 페이지가 도움이 되었나요?

0 / 5 - 0 등급

Pytorch: optim에서 사용된 운동량 공식에 대한 문서에 메모 추가

가장 유용한 댓글

모든 3 댓글

관련 문제