Hi, thank you very much for this valuable library!
The hardware and software environments are as follows:
When with the default train configuration file for imagenet data set, the train_net.bin will error with "out of memory". So I change the batch_size into 64 (128 also not valid). Then it works!
The following is the output of train_net.bin:
And the results are as follows after 2000 iterations:
It seems the testing scores are not changed. As indicated in https://github.com/BVLC/caffe/issues/218, @sguada said that the batch_size and the learning rate are linked. I have set the batch_size is 64, maybe the learning rate should also be modified. Could anyone give any advice on this subject, please?
@research2010 Did you changed the batch_size for the validation.prototxt? That would also help you reduce the memory usage.
Are you using the latest dev
since #355 training and testing share the data blobs and save quite a bit of memory.
Regarding the batch_size=64 for training should be okay, although base_lr is linked to the batch_size, it allows some variability. Originally base_lr = 0.01 with batch_size=128, we have also used with batch_size=256 and still works. In theory when you reduce the batch_size by a factor of X then you should increase the base_lr by a factor of sqrt(X)
, but Alex have used a factor of X (see http://arxiv.org/abs/1404.5997)
What you should change is the stepsize
and max_iter
, accordingly to keep the same learning scheduling. If you divide the batch_size by X then you should multiply those by X.
Pay attention to the loss, if it doesn't go below 6.9 (which is basically random guessing) after 10k-20k iterations, then your training is not learning anything.
@sguada , Thank you very much for your kind comments and suggestions.
I use the "git clone https://github.com/BVLC/caffe.git" to checkout the latest version at 2014-05-20
. So maybe it isn't the dev
branch, but it seems to have been patched by https://github.com/BVLC/caffe/pull/355/commits. I'll check the dev
branch and rerun the experiments.
Recently I have been using the GPU card to run other experiments. So I couldn't give the results in time. I'll give feedback as soon as the experiments on ImageNet data set restart.
@sguada @kloudkl , thank you very much for replying!
I have been running the imagenet example again. And some results are as follows:
caffe-0.9
and the latest dev
branch and use the train_imagenet.sh
to train the model, it seems the test score
don't decrease. And as suggested by @sguada , I modified as follows:imagenet_train.prototxt
, the batch_size is 128
,imagenet_val.prototxt
, the batch_size is 16
,imagenet_solver.prototxt
, the learning rate is 0.014142
, the stepsize is 200000
and the max_iter is 900000
.dev
branch and use the train_alexnet.sh
to train the model, it works fine! But the modification are as follows:alexnet_train.prototxt
, the batch_size is 64
,alexnet_val.prototxt
, the batch_size is 32
,alexnet_solver.prototxt
, the learning rate is 0.02
, the stepsize is 400000
and the max_iter is 1800000
.but when I use 128
as the training batch_size and 16
as the val batch_size, training with alexnet will error with out of memory
.
It seems that training with the alexnet works fine. I'm not sure what the problem of training caffenet is.
The hardware and software environments are as follows:
make runtest
is fine but just output 2 tests are disabled
as warning.And the two net is
Try setting the bias to 0.1 in all the layers
Sergio
2014-07-11 17:21 GMT-07:00 research2010 [email protected]:
And the two net is
[image: caffenet_alexnet]
https://cloud.githubusercontent.com/assets/1638818/3560107/7e88751c-095a-11e4-9ac5-9f95fc9c7b17.jpg—
Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/issues/430#issuecomment-48796648.
@sguada , OK, thank you!
I will try that after the training of the alexnet model is done.
It takes 2 hours for 7k iterations, so the total time will be 21 days for all 1800000 iterations!
I wish the computer and the graphics card will be safe!
@sguada , I'm sorry about that I just made a mistake for typing your name and "sergeyk" and I have corrected that.
@sguada , oh, I just forget that we could resume the training procedure. That's very convenient!
Hi, @sguada , I got some results when I replaced the 1 to 0.1 in bias filters. But it is very different from the results published in https://github.com/BVLC/caffe/pull/33,
It looks good to me. Given your reduced batch you will need to train for
many more iterations probably 1million. And reduce the lr when necessary.
On Saturday, July 12, 2014, conaniron [email protected] wrote:
Hi, @sguada https://github.com/sguada , I got some results when I
replaced the 1 to 0.1 in bias filters. But it is very different from the
results puslished in #33 https://github.com/BVLC/caffe/pull/33,[image: caffenet_trainloss_vs_iters_]
https://cloud.githubusercontent.com/assets/1638818/3563469/7dfc6264-0a38-11e4-9c6d-2fa822a769a7.gif[image: caffenet_test_accuracy_vs_iters_]
https://cloud.githubusercontent.com/assets/1638818/3563470/8d591086-0a38-11e4-96c4-2917b361d2d4.gif—
Reply to this email directly or view it on GitHub
https://github.com/BVLC/caffe/issues/430#issuecomment-48830196.
Sergio
@sguada , Thanks for your kindly comments.
I've been running the training of caffenet for about one week, and the results as follows is smilar to but a little different from that you have presented in https://github.com/BVLC/caffe/pull/33. As the reduced batch, it indeed needs more iterations as you said. And in this time of training, I just set the max_iter to 900000 for 90 epochs. It indeed needs more parameter adjustments, "To train these models is more of an art than a science" as indicated by Matthew Zeiler in http://www.wired.com/2014/07/clarifai/. Thank you very much for sharing your valuable experience and results of parameter adjustment.
Finally, the training has similar behavior with that in https://github.com/BVLC/caffe/pull/33, and the testing accuracy is ~56%, ~1% lower than that in https://github.com/BVLC/caffe/pull/33 and ~3.9% lower than that in Alex's paper in 2012.
It takes about 14 days for ~660000 iterations, and ~90s for 5120 images, which is much larger than the 26s of K20.
The configuration is:
Ubuntu 12.04
GTX 750 Ti (2G)
CUDA 6.0
Driver 331.44
Good to hear you got it working with the proper tuning!
@shelhamer , thanks for your comments!
Finally, it took 17 days for the training. However, there are just 20 17-day in a year. With the limited hardware, I didn't try the parameter adjustment. Many thanks to @sguada and guys who shared their experience of parameter tuning in https://github.com/BVLC/caffe/pull/33, they helps me a lot!
It takes about 3
hours and 20
minutes to train the first 10000
iterations of the BVLC_reference_caffenet
model with cuDNN
, and that above is about 4
hours and 40
minutes.
It is suggested to train with the switch of cuDNN
on.
@research2010 Hello,I see the accuracy result you plot has "second increase phase" in iter 200000.
How did you do it? My training is running for one month but it dose not increase any more since it get first "bottleneck" .
Thanks
@research2010
hey you commented on Jul 12, 2014 with the 2 pictures of the caffenet and alexnet, did you parse the prototxt file and print them out via graphviz? or how did u produce these two images?
Sorry to chime in so late on a closed issue -- but I'm trying to understand the same thing that WoooHaa commented about. What is the cause of the "bottlenecks" and how are these overcome? It seems dangerously easy to wait so long and think that training has converged to an optimal value, when it hasn't yet.
thats the "step" a change in the learning rate. So when there is a failure it changes the weights with a stronger effect. When u would start with that higher learning rate from the beginning, your program would start to bounce and would never get better so you have to start with a lower learning rate and increase it when your system reaches saturation. In the plots you can see that he set his step value to 200 000 because you see these changes at 200 000, 400 000 and 600 000.
Thank you for the response! Just to clarify... I usually start with a higher learning rate and decrease it over time. But what you say is actually _increase_ the learning later on during training?
https://github.com/BVLC/caffe/issues/430#issuecomment-48795443
oh you are right.. you have to drop the learning rate
http://caffe.berkeleyvision.org/tutorial/solver.html
made own test on the learning rate 4 month ago and did get confused...
Ah gotcha, it all makes sense now. Thank you!
Most helpful comment
@research2010 Did you changed the batch_size for the validation.prototxt? That would also help you reduce the memory usage.
Are you using the latest
dev
since #355 training and testing share the data blobs and save quite a bit of memory.Regarding the batch_size=64 for training should be okay, although base_lr is linked to the batch_size, it allows some variability. Originally base_lr = 0.01 with batch_size=128, we have also used with batch_size=256 and still works. In theory when you reduce the batch_size by a factor of X then you should increase the base_lr by a factor of
sqrt(X)
, but Alex have used a factor of X (see http://arxiv.org/abs/1404.5997)What you should change is the
stepsize
andmax_iter
, accordingly to keep the same learning scheduling. If you divide the batch_size by X then you should multiply those by X.Pay attention to the loss, if it doesn't go below 6.9 (which is basically random guessing) after 10k-20k iterations, then your training is not learning anything.