Caffe: How to train imagenet with reduced memory and batch size?

Created on 21 May 2014 · 23Comments · Source: BVLC/caffe

Hi, thank you very much for this valuable library!

The hardware and software environments are as follows:

NVIDIA GTX 750 Ti (2G)
Ubuntu 12.04

When with the default train configuration file for imagenet data set, the train_net.bin will error with "out of memory". So I change the batch_size into 64 (128 also not valid). Then it works!
The following is the output of train_net.bin:
qq20140521152144

And the results are as follows after 2000 iterations:
qq20140521155321

It seems the testing scores are not changed. As indicated in https://github.com/BVLC/caffe/issues/218, @sguada said that the batch_size and the learning rate are linked. I have set the batch_size is 64, maybe the learning rate should also be modified. Could anyone give any advice on this subject, please?

question

Source

research2010

👍3

Most helpful comment

@research2010 Did you changed the batch_size for the validation.prototxt? That would also help you reduce the memory usage.
Are you using the latest dev since #355 training and testing share the data blobs and save quite a bit of memory.

Regarding the batch_size=64 for training should be okay, although base_lr is linked to the batch_size, it allows some variability. Originally base_lr = 0.01 with batch_size=128, we have also used with batch_size=256 and still works. In theory when you reduce the batch_size by a factor of X then you should increase the base_lr by a factor of sqrt(X), but Alex have used a factor of X (see http://arxiv.org/abs/1404.5997)

What you should change is the stepsize and max_iter, accordingly to keep the same learning scheduling. If you divide the batch_size by X then you should multiply those by X.

Pay attention to the loss, if it doesn't go below 6.9 (which is basically random guessing) after 10k-20k iterations, then your training is not learning anything.

sguada on 22 May 2014

👍19

All 23 comments

What you should change is the stepsize and max_iter, accordingly to keep the same learning scheduling. If you divide the batch_size by X then you should multiply those by X.

Pay attention to the loss, if it doesn't go below 6.9 (which is basically random guessing) after 10k-20k iterations, then your training is not learning anything.

sguada on 22 May 2014

👍19

@sguada , Thank you very much for your kind comments and suggestions.

I use the "git clone https://github.com/BVLC/caffe.git" to checkout the latest version at 2014-05-20. So maybe it isn't the dev branch, but it seems to have been patched by https://github.com/BVLC/caffe/pull/355/commits. I'll check the dev branch and rerun the experiments.

Recently I have been using the GPU card to run other experiments. So I couldn't give the results in time. I'll give feedback as soon as the experiments on ImageNet data set restart.

research2010 on 23 May 2014

355 is not merged into dev yet.

kloudkl on 3 Jul 2014

@sguada @kloudkl , thank you very much for replying!

I have been running the imagenet example again. And some results are as follows:

When I use the caffe-0.9 and the latest dev branch and use the train_imagenet.sh to train the model, it seems the test score don't decrease. And as suggested by @sguada , I modified as follows:
(1) in the imagenet_train.prototxt, the batch_size is 128,
(2) in the imagenet_val.prototxt, the batch_size is 16,
(3) in the imagenet_solver.prototxt, the learning rate is 0.014142, the stepsize is 200000 and the max_iter is 900000.
and after 20k iterations, the test score is still 6.9.
When I use the latest dev branch and use the train_alexnet.sh to train the model, it works fine! But the modification are as follows:
(1) in the alexnet_train.prototxt, the batch_size is 64,
(2) in the alexnet_val.prototxt, the batch_size is 32,
(3) in the alexnet_solver.prototxt, the learning rate is 0.02, the stepsize is 400000 and the max_iter is 1800000.
and after only 4k iterations,

qq20140712075217

but when I use 128 as the training batch_size and 16 as the val batch_size, training with alexnet will error with out of memory.

It seems that training with the alexnet works fine. I'm not sure what the problem of training caffenet is.
The hardware and software environments are as follows:

NVIDIA GTX 750 Ti (2G)
Ubuntu 12.04
cuda 6.0
and the make runtest is fine but just output 2 tests are disabled as warning.

research2010 on 12 Jul 2014

And the two net is

caffenet_alexnet

research2010 on 12 Jul 2014

Try setting the bias to 0.1 in all the layers

Sergio

2014-07-11 17:21 GMT-07:00 research2010 [email protected]:

And the two net is

[image: caffenet_alexnet]
https://cloud.githubusercontent.com/assets/1638818/3560107/7e88751c-095a-11e4-9ac5-9f95fc9c7b17.jpg

—
Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/issues/430#issuecomment-48796648.

sguada on 12 Jul 2014

@sguada , OK, thank you!

I will try that after the training of the alexnet model is done.
It takes 2 hours for 7k iterations, so the total time will be 21 days for all 1800000 iterations!
I wish the computer and the graphics card will be safe!

research2010 on 12 Jul 2014

@sguada , I'm sorry about that I just made a mistake for typing your name and "sergeyk" and I have corrected that.

research2010 on 12 Jul 2014

@sguada , oh, I just forget that we could resume the training procedure. That's very convenient!

research2010 on 12 Jul 2014

Hi, @sguada , I got some results when I replaced the 1 to 0.1 in bias filters. But it is very different from the results published in https://github.com/BVLC/caffe/pull/33,

caffenet_trainloss_vs_iters_

caffenet_test_accuracy_vs_iters_

research2010 on 13 Jul 2014

It looks good to me. Given your reduced batch you will need to train for
many more iterations probably 1million. And reduce the lr when necessary.

On Saturday, July 12, 2014, conaniron [email protected] wrote:

Hi, @sguada https://github.com/sguada , I got some results when I
replaced the 1 to 0.1 in bias filters. But it is very different from the
results puslished in #33 https://github.com/BVLC/caffe/pull/33,

[image: caffenet_trainloss_vs_iters_]
https://cloud.githubusercontent.com/assets/1638818/3563469/7dfc6264-0a38-11e4-9c6d-2fa822a769a7.gif

[image: caffenet_test_accuracy_vs_iters_]
https://cloud.githubusercontent.com/assets/1638818/3563470/8d591086-0a38-11e4-96c4-2917b361d2d4.gif

—
Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/issues/430#issuecomment-48830196.

Sergio

sguada on 18 Jul 2014

@sguada , Thanks for your kindly comments.

I've been running the training of caffenet for about one week, and the results as follows is smilar to but a little different from that you have presented in https://github.com/BVLC/caffe/pull/33. As the reduced batch, it indeed needs more iterations as you said. And in this time of training, I just set the max_iter to 900000 for 90 epochs. It indeed needs more parameter adjustments, "To train these models is more of an art than a science" as indicated by Matthew Zeiler in http://www.wired.com/2014/07/clarifai/. Thank you very much for sharing your valuable experience and results of parameter adjustment.

caffenet_test_accuracy_vs_iters 2 _

caffenet_trainloss_vs_iters 2 _

research2010 on 19 Jul 2014

Finally, the training has similar behavior with that in https://github.com/BVLC/caffe/pull/33, and the testing accuracy is ~56%, ~1% lower than that in https://github.com/BVLC/caffe/pull/33 and ~3.9% lower than that in Alex's paper in 2012.
It takes about 14 days for ~660000 iterations, and ~90s for 5120 images, which is much larger than the 26s of K20.

The configuration is:
Ubuntu 12.04
GTX 750 Ti (2G)
CUDA 6.0
Driver 331.44

caffenet_test_accuracy_vs_iters

caffenet_trainloss_vs_iters

research2010 on 25 Jul 2014

Good to hear you got it working with the proper tuning!

shelhamer on 12 Aug 2014

@shelhamer , thanks for your comments!
Finally, it took 17 days for the training. However, there are just 20 17-day in a year. With the limited hardware, I didn't try the parameter adjustment. Many thanks to @sguada and guys who shared their experience of parameter tuning in https://github.com/BVLC/caffe/pull/33, they helps me a lot!

caffenet_test_accuracy_vs_iters_

caffenet_trainloss_vs_iters_

research2010 on 12 Aug 2014

It takes about 3 hours and 20 minutes to train the first 10000 iterations of the BVLC_reference_caffenet model with cuDNN, and that above is about 4 hours and 40 minutes.
It is suggested to train with the switch of cuDNN on.

research2010 on 18 Sep 2014

@research2010 Hello,I see the accuracy result you plot has "second increase phase" in iter 200000.
How did you do it? My training is running for one month but it dose not increase any more since it get first "bottleneck" .
Thanks

WoooHaa on 28 Aug 2015

@research2010
hey you commented on Jul 12, 2014 with the 2 pictures of the caffenet and alexnet, did you parse the prototxt file and print them out via graphviz? or how did u produce these two images?

DAIK0N on 9 Nov 2015

Sorry to chime in so late on a closed issue -- but I'm trying to understand the same thing that WoooHaa commented about. What is the cause of the "bottlenecks" and how are these overcome? It seems dangerously easy to wait so long and think that training has converged to an optimal value, when it hasn't yet.

jstaker7 on 22 Feb 2016

thats the "step" a change in the learning rate. So when there is a failure it changes the weights with a stronger effect. When u would start with that higher learning rate from the beginning, your program would start to bounce and would never get better so you have to start with a lower learning rate and increase it when your system reaches saturation. In the plots you can see that he set his step value to 200 000 because you see these changes at 200 000, 400 000 and 600 000.

DAIK0N on 22 Feb 2016

Thank you for the response! Just to clarify... I usually start with a higher learning rate and decrease it over time. But what you say is actually _increase_ the learning later on during training?

jstaker7 on 22 Feb 2016

https://github.com/BVLC/caffe/issues/430#issuecomment-48795443
oh you are right.. you have to drop the learning rate
http://caffe.berkeleyvision.org/tutorial/solver.html
made own test on the learning rate 4 month ago and did get confused...

DAIK0N on 22 Feb 2016

Ah gotcha, it all makes sense now. Thank you!

jstaker7 on 22 Feb 2016

Was this page helpful?

0 / 5 - 0 ratings