Pytorch: optimizer load_state_dict() problem?

Created on 22 Sep 2017  ·  23Comments  ·  Source: pytorch/pytorch

Hi, I encountered this bug:

    optimizer.step()
    exp_avg.mul_(beta1).add_(1 - beta1, grad)

TypeError: add_ received an invalid combination of arguments - got (float, torch.cuda.FloatTensor), but expected one of:
 * (float value)
 * (torch.FloatTensor other)
 * (torch.SparseFloatTensor other)
 * (float value, torch.FloatTensor other)
      didn't match because some of the arguments have invalid types: (float, torch.cuda.FloatTensor)
 * (float value, torch.SparseFloatTensor other)
      didn't match because some of the arguments have invalid types: (float, torch.cuda.FloatTensor)

The code skeleton is like:

model = Model()
model.load_state_dict(checkpoint['model'])
model.cuda()

optimizer = optim.Adam()
optimizer.load_state_dict(checkpoint['optimizer'])

...
#  In train loop
for epoch in range(...):
  ...
  optimizer.step()
     -> BUG <-

It seems the loaded param_groups are torch.cuda.FloatTensor, and I've tried a workaround to
move optmizer.param_groups to cpu, but it still has the same bug.

awaiting response (this tag is deprecated) needs reproduction

Most helpful comment

@apaszke Ah, my bad. I forgot to update the line where the optimizer is recreated. But otherwise, the following should do the job, right?

model = Model()
model.load_state_dict(checkpoint['model'])
model.cuda()
optimizer = optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

All 23 comments

Could you provide a full script to reproduce the problem?

maybe you can try like this,
optimizer.step()
exp_avg.mul_(beta1).add_(1 - beta1, grad.cpu())

Sorry, I missed the reply email.

I am afraid that I am unable to provide a reproducer now. It is a work I am doing for the OpenNMT-py project:https://github.com/OpenNMT/OpenNMT-py, trying to use lr_scheduler for doing lr update. And I encoutered this problem when testing the resume a suspended training case. So I factored out the code skeleton about this problem above.

I've tried several methods, including tricks like what @hefeicyp suggests, but it still happens.

Per my analysis, it is because the previous training was done on gpu, so when saving the optimizer.state_dict, the stored states(tensors) are of cuda version. During resuming, when we load the saved optimizer, load_state_dict() loads this cuda version to cpu(the model(nn.Module) can be moved to gpu easily, but torch.optimizer seems lacking this ability?) , so this problem emerges.

Try moving optimizer state to the GPU memory manually after loading it from the checkpoint.

optimizer = optim.Adam()
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

I agree that having an optimizer.cuda() method for this operation would be nice.

@dogancan, thanks. My work was suspended due to other problems, when resumed, I will try your method.

I'm afraid @dogancan's solution won't work. It will make the error go away, but your optimizer will no longer be training the model. You should recreate optimizers after casting modules to a different type or device, and you can use load_state_dict to restore the state from a previous copy. This currently doesn't work, but we should fix it (by copying from the data from the state dict, instead of using the tensors directly - this allows for cross-device or cross-type updates).

@apaszke , yep, your method is what I currently use, it works. But I will wait for upstream to fix this problem though. Thanks for your great works!

@apaszke Ah, my bad. I forgot to update the line where the optimizer is recreated. But otherwise, the following should do the job, right?

model = Model()
model.load_state_dict(checkpoint['model'])
model.cuda()
optimizer = optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

ah, right. That should work 😊

Except that you should use torch.is_tensor(v) instead of isinstance(v, torch.Tensor)

I had a similar problem. When I save the optimizer state from a GPU other than GPU 0 and then load the state it still loads everything to GPU 0. Specifying map_location in torch.load() didn't work either. @dogancan 's solution solves this though.

Hi guys, I have a very similar problem as the one in this thread, here's my code:

model = inceptionresnetv2(num_classes=config['tr_classes'])
model = torch.nn.DataParallel(model).cuda()
model.load_state_dict(checkpoint['md_state_dict'])
optimizer = torch.optim.Adam(model.parameters(), lr=config['tr_lr'], weight_decay=config['tr_weightdecay'])
optimizer.load_state_dict(checkpoint['md_optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if torch.is_tensor(v):
            state[k] = v.cuda()

And then once I resume, I got KeyErrors on my optimizer:

---> 40         optimizer.step()
     41 
     42         config['am_batch_time'].update(time.time() - end)
~/.conda/envs/env_pytorch/lib/python3.5/site-packages/torch/optim/adam.py in step(self, closure)
     44                     continue
     45                 grad = p.grad.data
---> 46                 state = self.state[p]
     47 
     48                 # State initialization
KeyError: Parameter containing:
(0 ,0 ,.,.) = 
 -1.6336e-01 -5.6482e-01 -4.2228e-02
...
[torch.cuda.FloatTensor of size 32x3x3x3 (GPU 0)]

Do you guys know how to fix this issue? BTW, I have 8 GPUs used, I'm guessing if this issue was because of that?

@CodArs-van were you able to solve your issue with multiple-GPUs?

@rafaelvalle Thanks for asking. Yeah, I'm able to, turns out the issue is because I used an early version of PyTorch, after I updated the version, it works like a charm!

Just a comment, this problem is caused by

    def load_state_dict(self, state_dict):
        ...
        # deepcopy, to be consistent with module API
        state_dict = deepcopy(state_dict)
       ...  

deepcopy makes all state tensor are moved into GPU0 ,
so by moving the state of an optimizer to specific GPU will fix this problem.

Hi @lzcn, how do you know the specific GPU location of different tensors in advance?

Would a feature where all torch.save() calls always makes use of an automatically generated CPU version be feasible ?
And then at resume the torch.load() would make use of the "current" device being used (or any better strategy).
At the moment it seems we need a lot of boilerplater code to ensure saving and loading are consistents across devices for models/optimizer/scheduler/etc.

I have met similar problem, I recreated Adam optimizer without optimizer.cuda() after reloading model, model.cuda() and DataParallel(model) according to @dogancan's solution.

thanks, it work!

@apaszke Ah, my bad. I forgot to update the line where the optimizer is recreated. But otherwise, the following should do the job, right?

model = Model()
model.load_state_dict(checkpoint['model'])
model.cuda()
optimizer = optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

@apaszke
Hi, as you say that: every time moving model to other device, we should build optimizer, but, if we move the model to other device and move back, should we build the optimizer again?
here is an example code:

model = Model()
model.cuda()
optimizer = optim.Adam(model.parameters())

for d, gt in trn_dataloader:
    # train
    ... 
    optimizer.step()
    model.cpu() # move to cpu
    # eval or do other things
    ...
    model.cuda()  # but finnally, move back

does optimizer run as expected?

also, if doing model.to(model.device), should we rebuild optimizer ?

@apaszke Ah, my bad. I forgot to update the line where the optimizer is recreated. But otherwise, the following should do the job, right?

model = Model()
model.load_state_dict(checkpoint['model'])
model.cuda()
optimizer = optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.cuda()

@apaszke Is there a problem if you switch the order to something like this?

```python
model = Model()
model.to('cuda')
optimizer = optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
for k, v in state.items():
if isinstance(v, torch.Tensor):
state[k] = v.cuda()
model.load_state_dict(checkpoint['model'])

Meaning moving the model to 'cuda' but only loading it's state dict from checkpoint after loading the optimizer's state dict first?

The problem can be concluded that the optimizer's state will be loaded to the device as same as the model. You must load the model to GPU at first, and then load the optimizer's state. So that both the model and the optimizer's state are loaded in GPU.

Instead of moving optimizer to cuda after loading it in cpu, you could load the checkpoint directly in cuda:

model.to(device)

ckpt = torch.load(<model_path>, map_location=device)

model.load_state_dict(ckpt['state_dict'])
optimizer.load_state_dict(ckpt['optimizer'])
scheduler.load_state_dict(ckpt['scheduler'])

del ckpt
Was this page helpful?
0 / 5 - 0 ratings