Pytorch: Variable input size training is slow

Created on 17 Sep 2017  ·  3Comments  ·  Source: pytorch/pytorch

I have a model modified from resnet50, just remove the last avgpool & fc.
I found if I constantly changing the input size during training, the speed is slow.

Minimal code:

import time
import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn
import numpy as np
from torch.autograd import Variable

# ... remove avgpool & fc from resnet50 here
net = resnet50()
net.cuda()
net = torch.nn.DataParallel(net, device_ids=range(torch.cuda.device_count()))
cudnn.benchmark = True

for i in range(10):
    h = np.random.randint(400,600)
    w = np.random.randint(400,600)
    # or fix h = w = 600
    x = Variable(torch.randn(1,3,h,w)).cuda()

    t1 = time.time()
    y = net(x)
    t2 = time.time()
    print(t2-t1)
  1. If I fix input size to [600,600], what I got on my 8 Nvidia P40 machine is:
3.14512705803
0.11568403244
0.0255229473114
0.0228650569916
0.0235478878021
0.0225219726562
0.0436158180237
0.0222969055176
0.0223350524902
0.0227248668671
  1. If I change the input size randomly from [400,600], I got:
3.12573313713
0.670918941498
2.32590889931
2.3486700058
2.31507301331
0.593285083771
0.68169093132
2.34181690216
0.597991943359
1.74615192413

I also trained with only CPU, both works OK. So I think the reason might related to CUDA overhead. Any ideas to fix this?

Most helpful comment

As @fmassa says here: https://discuss.pytorch.org/t/pytorch-performance/3079/7?u=smth

In benchmark mode, for each input size, cudnn will perform a bunch of computations to infer the fastest algorithm for that specific case, and caches the result. This brings some overhead, and if your input dimensions change all the time, using benchmark will actually slow down things because of this overhead.

All 3 comments

do you set cudnn.benchmark=True anywhere in your code? that is probably the culprit.

As @fmassa says here: https://discuss.pytorch.org/t/pytorch-performance/3079/7?u=smth

In benchmark mode, for each input size, cudnn will perform a bunch of computations to infer the fastest algorithm for that specific case, and caches the result. This brings some overhead, and if your input dimensions change all the time, using benchmark will actually slow down things because of this overhead.

Cool. I comment the line out, both works OK now.
Thanks.

Was this page helpful?
0 / 5 - 0 ratings