Caffe: How to train imagenet with reduced memory and batch size?

Created on 21 May 2014  ·  23Comments  ·  Source: BVLC/caffe

Hi, thank you very much for this valuable library!

The hardware and software environments are as follows:

  1. NVIDIA GTX 750 Ti (2G)
  2. Ubuntu 12.04

When with the default train configuration file for imagenet data set, the train_net.bin will error with "out of memory". So I change the batch_size into 64 (128 also not valid). Then it works!
The following is the output of train_net.bin:
qq20140521152144

And the results are as follows after 2000 iterations:
qq20140521155321

It seems the testing scores are not changed. As indicated in https://github.com/BVLC/caffe/issues/218, @sguada said that the batch_size and the learning rate are linked. I have set the batch_size is 64, maybe the learning rate should also be modified. Could anyone give any advice on this subject, please?

question

Most helpful comment

@research2010 Did you changed the batch_size for the validation.prototxt? That would also help you reduce the memory usage.
Are you using the latest dev since #355 training and testing share the data blobs and save quite a bit of memory.

Regarding the batch_size=64 for training should be okay, although base_lr is linked to the batch_size, it allows some variability. Originally base_lr = 0.01 with batch_size=128, we have also used with batch_size=256 and still works. In theory when you reduce the batch_size by a factor of X then you should increase the base_lr by a factor of sqrt(X), but Alex have used a factor of X (see http://arxiv.org/abs/1404.5997)

What you should change is the stepsize and max_iter, accordingly to keep the same learning scheduling. If you divide the batch_size by X then you should multiply those by X.

Pay attention to the loss, if it doesn't go below 6.9 (which is basically random guessing) after 10k-20k iterations, then your training is not learning anything.

All 23 comments

@research2010 Did you changed the batch_size for the validation.prototxt? That would also help you reduce the memory usage.
Are you using the latest dev since #355 training and testing share the data blobs and save quite a bit of memory.

Regarding the batch_size=64 for training should be okay, although base_lr is linked to the batch_size, it allows some variability. Originally base_lr = 0.01 with batch_size=128, we have also used with batch_size=256 and still works. In theory when you reduce the batch_size by a factor of X then you should increase the base_lr by a factor of sqrt(X), but Alex have used a factor of X (see http://arxiv.org/abs/1404.5997)

What you should change is the stepsize and max_iter, accordingly to keep the same learning scheduling. If you divide the batch_size by X then you should multiply those by X.

Pay attention to the loss, if it doesn't go below 6.9 (which is basically random guessing) after 10k-20k iterations, then your training is not learning anything.

@sguada , Thank you very much for your kind comments and suggestions.

I use the "git clone https://github.com/BVLC/caffe.git" to checkout the latest version at 2014-05-20. So maybe it isn't the dev branch, but it seems to have been patched by https://github.com/BVLC/caffe/pull/355/commits. I'll check the dev branch and rerun the experiments.

Recently I have been using the GPU card to run other experiments. So I couldn't give the results in time. I'll give feedback as soon as the experiments on ImageNet data set restart.

@sguada @kloudkl , thank you very much for replying!

I have been running the imagenet example again. And some results are as follows:

  1. When I use the caffe-0.9 and the latest dev branch and use the train_imagenet.sh to train the model, it seems the test score don't decrease. And as suggested by @sguada , I modified as follows:
    (1) in the imagenet_train.prototxt, the batch_size is 128,
    (2) in the imagenet_val.prototxt, the batch_size is 16,
    (3) in the imagenet_solver.prototxt, the learning rate is 0.014142, the stepsize is 200000 and the max_iter is 900000.
    and after 20k iterations, the test score is still 6.9.
  2. When I use the latest dev branch and use the train_alexnet.sh to train the model, it works fine! But the modification are as follows:
    (1) in the alexnet_train.prototxt, the batch_size is 64,
    (2) in the alexnet_val.prototxt, the batch_size is 32,
    (3) in the alexnet_solver.prototxt, the learning rate is 0.02, the stepsize is 400000 and the max_iter is 1800000.
    and after only 4k iterations,

qq20140712075217

but when I use 128 as the training batch_size and 16 as the val batch_size, training with alexnet will error with out of memory.

It seems that training with the alexnet works fine. I'm not sure what the problem of training caffenet is.
The hardware and software environments are as follows:

  1. NVIDIA GTX 750 Ti (2G)
  2. Ubuntu 12.04
  3. cuda 6.0
    and the make runtest is fine but just output 2 tests are disabled as warning.

And the two net is

caffenet_alexnet

Try setting the bias to 0.1 in all the layers

Sergio

2014-07-11 17:21 GMT-07:00 research2010 [email protected]:

And the two net is

[image: caffenet_alexnet]
https://cloud.githubusercontent.com/assets/1638818/3560107/7e88751c-095a-11e4-9ac5-9f95fc9c7b17.jpg


Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/issues/430#issuecomment-48796648.

@sguada , OK, thank you!

I will try that after the training of the alexnet model is done.
It takes 2 hours for 7k iterations, so the total time will be 21 days for all 1800000 iterations!
I wish the computer and the graphics card will be safe!

@sguada , I'm sorry about that I just made a mistake for typing your name and "sergeyk" and I have corrected that.

@sguada , oh, I just forget that we could resume the training procedure. That's very convenient!

Hi, @sguada , I got some results when I replaced the 1 to 0.1 in bias filters. But it is very different from the results published in https://github.com/BVLC/caffe/pull/33,

caffenet_trainloss_vs_iters_

caffenet_test_accuracy_vs_iters_

It looks good to me. Given your reduced batch you will need to train for
many more iterations probably 1million. And reduce the lr when necessary.

On Saturday, July 12, 2014, conaniron [email protected] wrote:

Hi, @sguada https://github.com/sguada , I got some results when I
replaced the 1 to 0.1 in bias filters. But it is very different from the
results puslished in #33 https://github.com/BVLC/caffe/pull/33,

[image: caffenet_trainloss_vs_iters_]
https://cloud.githubusercontent.com/assets/1638818/3563469/7dfc6264-0a38-11e4-9c6d-2fa822a769a7.gif

[image: caffenet_test_accuracy_vs_iters_]
https://cloud.githubusercontent.com/assets/1638818/3563470/8d591086-0a38-11e4-96c4-2917b361d2d4.gif


Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/issues/430#issuecomment-48830196.

Sergio

@sguada , Thanks for your kindly comments.

I've been running the training of caffenet for about one week, and the results as follows is smilar to but a little different from that you have presented in https://github.com/BVLC/caffe/pull/33. As the reduced batch, it indeed needs more iterations as you said. And in this time of training, I just set the max_iter to 900000 for 90 epochs. It indeed needs more parameter adjustments, "To train these models is more of an art than a science" as indicated by Matthew Zeiler in http://www.wired.com/2014/07/clarifai/. Thank you very much for sharing your valuable experience and results of parameter adjustment.

caffenet_test_accuracy_vs_iters 2 _

caffenet_trainloss_vs_iters 2 _

Finally, the training has similar behavior with that in https://github.com/BVLC/caffe/pull/33, and the testing accuracy is ~56%, ~1% lower than that in https://github.com/BVLC/caffe/pull/33 and ~3.9% lower than that in Alex's paper in 2012.
It takes about 14 days for ~660000 iterations, and ~90s for 5120 images, which is much larger than the 26s of K20.

The configuration is:
Ubuntu 12.04
GTX 750 Ti (2G)
CUDA 6.0
Driver 331.44

caffenet_test_accuracy_vs_iters

caffenet_trainloss_vs_iters

Good to hear you got it working with the proper tuning!

@shelhamer , thanks for your comments!
Finally, it took 17 days for the training. However, there are just 20 17-day in a year. With the limited hardware, I didn't try the parameter adjustment. Many thanks to @sguada and guys who shared their experience of parameter tuning in https://github.com/BVLC/caffe/pull/33, they helps me a lot!

caffenet_test_accuracy_vs_iters_

caffenet_trainloss_vs_iters_

It takes about 3 hours and 20 minutes to train the first 10000 iterations of the BVLC_reference_caffenet model with cuDNN, and that above is about 4 hours and 40 minutes.
It is suggested to train with the switch of cuDNN on.

@research2010 Hello,I see the accuracy result you plot has "second increase phase" in iter 200000.
How did you do it? My training is running for one month but it dose not increase any more since it get first "bottleneck" .
Thanks

@research2010
hey you commented on Jul 12, 2014 with the 2 pictures of the caffenet and alexnet, did you parse the prototxt file and print them out via graphviz? or how did u produce these two images?

Sorry to chime in so late on a closed issue -- but I'm trying to understand the same thing that WoooHaa commented about. What is the cause of the "bottlenecks" and how are these overcome? It seems dangerously easy to wait so long and think that training has converged to an optimal value, when it hasn't yet.

thats the "step" a change in the learning rate. So when there is a failure it changes the weights with a stronger effect. When u would start with that higher learning rate from the beginning, your program would start to bounce and would never get better so you have to start with a lower learning rate and increase it when your system reaches saturation. In the plots you can see that he set his step value to 200 000 because you see these changes at 200 000, 400 000 and 600 000.

Thank you for the response! Just to clarify... I usually start with a higher learning rate and decrease it over time. But what you say is actually _increase_ the learning later on during training?

https://github.com/BVLC/caffe/issues/430#issuecomment-48795443
oh you are right.. you have to drop the learning rate
http://caffe.berkeleyvision.org/tutorial/solver.html
made own test on the learning rate 4 month ago and did get confused...

Ah gotcha, it all makes sense now. Thank you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

erogol picture erogol  ·  3Comments

vladislavdonchev picture vladislavdonchev  ·  3Comments

Ruhjkg picture Ruhjkg  ·  3Comments

LarsHH picture LarsHH  ·  3Comments

prathmeshrmadhu picture prathmeshrmadhu  ·  3Comments