Pytorch: Agregue una nota en los documentos sobre la formulación de impulso utilizada en optim

Creado en 25 mar. 2017 · 3Comentarios · Fuente: pytorch/pytorch

He estado mirando la implementación de SGD + Momentum en PyTorch y noté algo un poco diferente de cómo lo describen otros paquetes (y documentos). Por el momento, centrémonos únicamente en el impulso (clásico) y no en la versión de Nesterov.

En el momento de escribir este artículo, la implementación dice:

''
si impulso! = 0:
param_state = self.state [p]
si 'momentum_buffer' no está en param_state:
buf = param_state ['momentum_buffer'] = d_p.clone ()
demás:
buf = param_state ['momentum_buffer']
buf.mul_ (impulso) .add_ (1 - amortiguación, d_p)
si nesterov:
d_p = d_p.add (impulso, buf)
demás:
d_p = buf

            p.data.add_(-group['lr'], d_p)

Mathematically, if we denote the momentum buffer by `v` and assume that `dampening=0`, at every iteration, the buffer is updated as `v = m*v + g` and the step is `∆x = lr * v`. Notice that the learning rate `lr` hits the momentum term `v` as well as the gradient. To me, this is different from what classical momentum is, and also differs from how other packages implement SGD+M.

Let us contrast this with the Sutskever et. al. paper and other commonly used pacakges such as Lasagne, Keras, Neon, etc.

## [Sutskever et. al.](http://www.jmlr.org/proceedings/papers/v28/sutskever13.pdf)
The snippet of the relevant section is pasted below. 
![Sutskever et. al.](http://i.imgur.com/QJelodE.png)

Retaining the syntax from above, the algorithm updates `v` as `v = m*v - lr * g` with the step `∆x = v`. So, the learning rate `lr` only hits the gradient. It does not (explicitly) influence the effect of the momentum term which is in contrast with PyTorch's implementation. 

# [Lasagne](https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py#L217)

Lasagne employs the same rule as suggested in Sutskever for momentum.

for param in params:
    value = param.get_value(borrow=True)
    velocity = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                             broadcastable=param.broadcastable)
    x = momentum * velocity + updates[param]
    updates[velocity] = x - param

# [Keras](https://github.com/fchollet/keras/blob/master/keras/optimizers.py#L141)

Same for Keras:

   for p, g, m in zip(params, grads, moments):
        v = self.momentum * m - lr * g  # velocity
        self.updates.append(K.update(m, v))

        if self.nesterov:
            new_p = p + self.momentum * v - lr * g
        else:
            new_p = p + v

# [Neon](https://github.com/NervanaSystems/neon/blob/master/neon/optimizers/optimizer.py#L520)

and Neon.

            velocity[:] = self.momentum_coef * velocity - lrate * grad

            # Nesterov accelerated gradient (NAG) is implemented the same
            # as in torch's "sgd.lua". It's a reformulation of Sutskever's
            # NAG equation found in "On the importance of initialization
            # and momentum in deep learning".
            if self.nesterov:
                param[:] = param + self.momentum_coef * velocity -\
                           lrate * grad
            else:
                param[:] = param + velocity

''
¿Es cierta la disparidad o me falta algo importante?

La diferencia entre las dos implementaciones no es insignificante y especialmente cuando lr se reduce en el camino. Si mi afirmación es cierta, tal vez podríamos actualizar la referencia (no estoy seguro de cuál sería) o incluir la versión anterior en el código SGD (puedo retomar esto si es necesario).

medium priority (this tag is deprecated)

Fuente

keskarnitish

👍4

Comentario más útil

Para una tasa de aprendizaje fija, las dos formulaciones son equivalentes. La fórmula de la antorcha se elige porque el tamaño del paso es directamente proporcional a la tasa de aprendizaje. Esto significa que si disminuye la tasa de aprendizaje, el tamaño del paso disminuye inmediatamente y no después de un cierto número de iteraciones, que es generalmente lo que desea.

colesbury en 25 mar. 2017

👍3

Todos 3 comentarios

colesbury en 25 mar. 2017

👍3

Estoy de acuerdo. Mi única preocupación era que, dado que la referencia para el método es el documento de Sutskever y no hay documentación para explicar la diferencia, la implementación actual podría ser un _ "atrapa" _ potencial para las personas que se trasladan a PyTorch desde otros marcos.

keskarnitish en 25 mar. 2017

👍1

@keskarnitish si envía un PR agregando una nota a los documentos, estoy feliz de fusionarme.

soumith en 5 abr. 2017

👍1

¿Fue útil esta página

0 / 5 - 0 calificaciones

Temas relacionados

[v1.7.0] Rastreador de versiones

seemethere · 65Comentarios

RuntimeError: CUDA sin memoria. Intenté asignar 12,50 MiB (GPU 0; 10,92 GiB de capacidad total; 8,57 MiB ya asignados; 9,28 GiB libres; 4,68 MiB en caché)

EMarquer · 91Comentarios

from torch._C import * (ImportError: Error de carga de DLL: no se pudo encontrar el módulo especificado.

HarshneetBhatia · 172Comentarios

reconstruir pip wheels con manylinux

soumith · 60Comentarios

RuntimeError: error CUDA: se encontró un acceso ilegal a la memoria

xiaoxiangyeyuwangye · 103Comentarios