Pytorch: optimizer load_state_dict() problem?

Created on 22 Sep 2017 · 23Comments · Source: pytorch/pytorch

Hi, I encountered this bug:

    optimizer.step()
    exp_avg.mul_(beta1).add_(1 - beta1, grad)

TypeError: add_ received an invalid combination of arguments - got (float, torch.cuda.FloatTensor), but expected one of:
 * (float value)
 * (torch.FloatTensor other)
 * (torch.SparseFloatTensor other)
 * (float value, torch.FloatTensor other)
      didn't match because some of the arguments have invalid types: (float, torch.cuda.FloatTensor)
 * (float value, torch.SparseFloatTensor other)
      didn't match because some of the arguments have invalid types: (float, torch.cuda.FloatTensor)

The code skeleton is like:

model = Model()
model.load_state_dict(checkpoint['model'])
model.cuda()

optimizer = optim.Adam()
optimizer.load_state_dict(checkpoint['optimizer'])

...
#  In train loop
for epoch in range(...):
  ...
  optimizer.step()
     -> BUG <-

It seems the loaded param_groups are torch.cuda.FloatTensor, and I've tried a workaround to
move optmizer.param_groups to cpu, but it still has the same bug.

awaiting response (this tag is deprecated) needs reproduction

Source

JianyuZhan

👍12

Most helpful comment

@apaszke Ah, my bad. I forgot to update the line where the optimizer is recreated. But otherwise, the following should do the job, right?

model = Model()
model.load_state_dict(checkpoint['model'])
model.cuda()
optimizer = optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

dogancan on 12 Oct 2017

👍82 🚀13 ❤7 🎉7 😄5 👀4

All 23 comments

Could you provide a full script to reproduce the problem?

chenzhekl on 27 Sep 2017

👍1

maybe you can try like this,
optimizer.step()
exp_avg.mul_(beta1).add_(1 - beta1, grad.cpu())

hefeicyp on 11 Oct 2017

Sorry, I missed the reply email.

I am afraid that I am unable to provide a reproducer now. It is a work I am doing for the OpenNMT-py project:https://github.com/OpenNMT/OpenNMT-py, trying to use lr_scheduler for doing lr update. And I encoutered this problem when testing the resume a suspended training case. So I factored out the code skeleton about this problem above.

I've tried several methods, including tricks like what @hefeicyp suggests, but it still happens.

Per my analysis, it is because the previous training was done on gpu, so when saving the optimizer.state_dict, the stored states(tensors) are of cuda version. During resuming, when we load the saved optimizer, load_state_dict() loads this cuda version to cpu(the model(nn.Module) can be moved to gpu easily, but torch.optimizer seems lacking this ability?) , so this problem emerges.

JianyuZhan on 11 Oct 2017

👍5

Try moving optimizer state to the GPU memory manually after loading it from the checkpoint.

optimizer = optim.Adam()
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

I agree that having an optimizer.cuda() method for this operation would be nice.

dogancan on 12 Oct 2017

👍54 ❤4 🎉4 👎2

@dogancan, thanks. My work was suspended due to other problems, when resumed, I will try your method.

JianyuZhan on 12 Oct 2017

I'm afraid @dogancan's solution won't work. It will make the error go away, but your optimizer will no longer be training the model. You should recreate optimizers after casting modules to a different type or device, and you can use load_state_dict to restore the state from a previous copy. This currently doesn't work, but we should fix it (by copying from the data from the state dict, instead of using the tensors directly - this allows for cross-device or cross-type updates).

apaszke on 12 Oct 2017

@apaszke , yep, your method is what I currently use, it works. But I will wait for upstream to fix this problem though. Thanks for your great works!

JianyuZhan on 12 Oct 2017

@apaszke Ah, my bad. I forgot to update the line where the optimizer is recreated. But otherwise, the following should do the job, right?

model = Model()
model.load_state_dict(checkpoint['model'])
model.cuda()
optimizer = optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

dogancan on 12 Oct 2017

👍82 🚀13 ❤7 🎉7 😄5 👀4

ah, right. That should work 😊

apaszke on 12 Oct 2017

Except that you should use torch.is_tensor(v) instead of isinstance(v, torch.Tensor)

apaszke on 12 Oct 2017

👍18 😄3

I had a similar problem. When I save the optimizer state from a GPU other than GPU 0 and then load the state it still loads everything to GPU 0. Specifying map_location in torch.load() didn't work either. @dogancan 's solution solves this though.

stormraiser on 23 Oct 2017

Hi guys, I have a very similar problem as the one in this thread, here's my code:

model = inceptionresnetv2(num_classes=config['tr_classes'])
model = torch.nn.DataParallel(model).cuda()
model.load_state_dict(checkpoint['md_state_dict'])
optimizer = torch.optim.Adam(model.parameters(), lr=config['tr_lr'], weight_decay=config['tr_weightdecay'])
optimizer.load_state_dict(checkpoint['md_optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if torch.is_tensor(v):
            state[k] = v.cuda()

And then once I resume, I got KeyErrors on my optimizer:

---> 40         optimizer.step()
     41 
     42         config['am_batch_time'].update(time.time() - end)
~/.conda/envs/env_pytorch/lib/python3.5/site-packages/torch/optim/adam.py in step(self, closure)
     44                     continue
     45                 grad = p.grad.data
---> 46                 state = self.state[p]
     47 
     48                 # State initialization
KeyError: Parameter containing:
(0 ,0 ,.,.) = 
 -1.6336e-01 -5.6482e-01 -4.2228e-02
...
[torch.cuda.FloatTensor of size 32x3x3x3 (GPU 0)]

Do you guys know how to fix this issue? BTW, I have 8 GPUs used, I'm guessing if this issue was because of that?

codars on 18 Nov 2017

@CodArs-van were you able to solve your issue with multiple-GPUs?

rafaelvalle on 5 Feb 2018

@rafaelvalle Thanks for asking. Yeah, I'm able to, turns out the issue is because I used an early version of PyTorch, after I updated the version, it works like a charm!

codars on 5 Feb 2018

👍1

Just a comment, this problem is caused by

    def load_state_dict(self, state_dict):
        ...
        # deepcopy, to be consistent with module API
        state_dict = deepcopy(state_dict)
       ...

deepcopy makes all state tensor are moved into GPU0 ，
so by moving the state of an optimizer to specific GPU will fix this problem.

lzcn on 18 Mar 2018

👍2

Hi @lzcn, how do you know the specific GPU location of different tensors in advance?

chrisliu54 on 15 Aug 2018

Would a feature where all torch.save() calls always makes use of an automatically generated CPU version be feasible ?
And then at resume the torch.load() would make use of the "current" device being used (or any better strategy).
At the moment it seems we need a lot of boilerplater code to ensure saving and loading are consistents across devices for models/optimizer/scheduler/etc.

sebastienwood on 29 Mar 2019

👍1

I have met similar problem, I recreated Adam optimizer without optimizer.cuda() after reloading model, model.cuda() and DataParallel(model) according to @dogancan's solution.

ran337287 on 19 Apr 2019

thanks, it work!

@apaszke Ah, my bad. I forgot to update the line where the optimizer is recreated. But otherwise, the following should do the job, right?
model = Model()
model.load_state_dict(checkpoint['model'])
model.cuda()
optimizer = optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

jiangzhonglian on 25 Jul 2019

👍2

@apaszke
Hi, as you say that: every time moving model to other device, we should build optimizer, but, if we move the model to other device and move back, should we build the optimizer again?
here is an example code:

model = Model()
model.cuda()
optimizer = optim.Adam(model.parameters())

for d, gt in trn_dataloader:
    # train
    ... 
    optimizer.step()
    model.cpu() # move to cpu
    # eval or do other things
    ...
    model.cuda()  # but finnally, move back

does optimizer run as expected?

also, if doing model.to(model.device), should we rebuild optimizer ?

menghuu on 25 Aug 2019

@apaszke Ah, my bad. I forgot to update the line where the optimizer is recreated. But otherwise, the following should do the job, right?
model = Model()
model.load_state_dict(checkpoint['model'])
model.cuda()
optimizer = optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

@apaszke Is there a problem if you switch the order to something like this?

```python
model = Model()
model.to('cuda')
optimizer = optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
for k, v in state.items():
if isinstance(v, torch.Tensor):
state[k] = v.cuda()
model.load_state_dict(checkpoint['model'])

Meaning moving the model to 'cuda' but only loading it's state dict from checkpoint after loading the optimizer's state dict first?

mistermoutan on 13 Jul 2020

The problem can be concluded that the optimizer's state will be loaded to the device as same as the model. You must load the model to GPU at first, and then load the optimizer's state. So that both the model and the optimizer's state are loaded in GPU.

pingguokiller on 30 Sep 2020

Instead of moving optimizer to cuda after loading it in cpu, you could load the checkpoint directly in cuda:

model.to(device)

ckpt = torch.load(<model_path>, map_location=device)

model.load_state_dict(ckpt['state_dict'])
optimizer.load_state_dict(ckpt['optimizer'])
scheduler.load_state_dict(ckpt['scheduler'])

del ckpt