Pytorch: 在文档中添加关于 optim 中使用的动量公式的注释

创建于 2017-03-25 · 3评论 · 资料来源: pytorch/pytorch

我一直在研究 PyTorch 中 SGD + Momentum 的实现，并注意到与其他包（和论文）描述它的方式有些不同。目前，让我们只关注（经典）动量，而不是 Nesterov 的版本。

在撰写本文时，实现内容如下：

``
如果动量 != 0：
param_state = self.state[p]
如果 'momentum_buffer' 不在 param_state 中：
buf = param_state['momentum_buffer'] = d_p.clone()
别的：
buf = param_state['momentum_buffer']
buf.mul_(momentum).add_(1 - 阻尼，d_p)
如果内斯特罗夫：
d_p = d_p.add（动量，buf）
别的：
d_p = buf

            p.data.add_(-group['lr'], d_p)

Mathematically, if we denote the momentum buffer by `v` and assume that `dampening=0`, at every iteration, the buffer is updated as `v = m*v + g` and the step is `∆x = lr * v`. Notice that the learning rate `lr` hits the momentum term `v` as well as the gradient. To me, this is different from what classical momentum is, and also differs from how other packages implement SGD+M.

Let us contrast this with the Sutskever et. al. paper and other commonly used pacakges such as Lasagne, Keras, Neon, etc.

## [Sutskever et. al.](http://www.jmlr.org/proceedings/papers/v28/sutskever13.pdf)
The snippet of the relevant section is pasted below. 
![Sutskever et. al.](http://i.imgur.com/QJelodE.png)

Retaining the syntax from above, the algorithm updates `v` as `v = m*v - lr * g` with the step `∆x = v`. So, the learning rate `lr` only hits the gradient. It does not (explicitly) influence the effect of the momentum term which is in contrast with PyTorch's implementation. 

# [Lasagne](https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py#L217)

Lasagne employs the same rule as suggested in Sutskever for momentum.

for param in params:
    value = param.get_value(borrow=True)
    velocity = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                             broadcastable=param.broadcastable)
    x = momentum * velocity + updates[param]
    updates[velocity] = x - param

# [Keras](https://github.com/fchollet/keras/blob/master/keras/optimizers.py#L141)

Same for Keras:

   for p, g, m in zip(params, grads, moments):
        v = self.momentum * m - lr * g  # velocity
        self.updates.append(K.update(m, v))

        if self.nesterov:
            new_p = p + self.momentum * v - lr * g
        else:
            new_p = p + v

# [Neon](https://github.com/NervanaSystems/neon/blob/master/neon/optimizers/optimizer.py#L520)

and Neon.

            velocity[:] = self.momentum_coef * velocity - lrate * grad

            # Nesterov accelerated gradient (NAG) is implemented the same
            # as in torch's "sgd.lua". It's a reformulation of Sutskever's
            # NAG equation found in "On the importance of initialization
            # and momentum in deep learning".
            if self.nesterov:
                param[:] = param + self.momentum_coef * velocity -\
                           lrate * grad
            else:
                param[:] = param + velocity

``
这种差异是真的还是我遗漏了一些重要的东西？

两种实现之间的差异并非微不足道，尤其是当lr一路减少时。如果我的说法属实，也许我们可以更新参考资料（我不确定那是什么）或将上述版本包含在 SGD 代码中（如有必要，我可以采纳）？

medium priority (this tag is deprecated)

资料来源

keskarnitish

👍4

最有用的评论

对于固定的学习率，这两个公式是等价的。选择 Torch 公式是因为步长与学习率成正比。这意味着如果你降低学习率，步长会立即减小，而不是经过一定次数的迭代后，这通常是你想要的。

colesbury 于 2017-03-25

👍3

所有3条评论

colesbury 于 2017-03-25

👍3

我同意。我唯一担心的是，鉴于该方法的参考文献是 Sutskever 论文并且没有解释差异的文档，对于从其他框架迁移到 PyTorch 的人们来说，当前的实现可能是一个潜在的_“陷阱”_。

keskarnitish 于 2017-03-25

👍1

@keskarnitish如果您发送 PR 向文档添加注释，我很乐意合并。

soumith 于 2017-04-05

👍1

此页面是否有帮助？

0 / 5 - 0 等级

Pytorch: 在文档中添加关于 optim 中使用的动量公式的注释

最有用的评论

所有3条评论

相关问题