Detectron: context_gpu.cu causing memory issues

Created on 5 Mar 2018 · 4Comments · Source: facebookresearch/Detectron

This problem for me occurs very randomly. The network (in this case Retinanet) is training just fine, when at a random number of iterations context_gpu.cu fires up and seems to eat up the gpu memory such that the training is halted with an out of memory error.

We're using Ubuntu 16.04 with Pascal GPUs. Happens on several machines, with different numbers of GPUs (1-4) and when training different network architectures.

Any thoughts?

json_stats: {"eta": "17:31:05", "fl_fpn3": 0.000720, "fl_fpn4": 0.001108, "fl_fpn5": 0.004955, "fl_fpn6": 0.020188, "fl_fpn7": 0.019255, "iter": 137400, "loss": 0.174941, "lr": 0.001250, "mb_qsize": 64, "mem": 7110, "retnet_bg_num": 31416503.000000, "retnet_fg_num": 63.500000, "retnet_loss_bbox_fpn3": 0.000000, "retnet_loss_bbox_fpn4": 0.000000, "retnet_loss_bbox_fpn5": 0.000000, "retnet_loss_bbox_fpn6": 0.011040, "retnet_loss_bbox_fpn7": 0.004171, "time": 0.614669}
json_stats: {"eta": "17:30:52", "fl_fpn3": 0.000606, "fl_fpn4": 0.001287, "fl_fpn5": 0.004547, "fl_fpn6": 0.027491, "fl_fpn7": 0.009270, "iter": 137420, "loss": 0.137118, "lr": 0.001250, "mb_qsize": 64, "mem": 7110, "retnet_bg_num": 31416766.000000, "retnet_fg_num": 65.000000, "retnet_loss_bbox_fpn3": 0.000000, "retnet_loss_bbox_fpn4": 0.000000, "retnet_loss_bbox_fpn5": 0.000000, "retnet_loss_bbox_fpn6": 0.015492, "retnet_loss_bbox_fpn7": 0.005272, "time": 0.614670}
I0225 23:55:34.962450 20807 context_gpu.cu:321] GPU 0: 7179 MB
I0225 23:55:34.962478 20807 context_gpu.cu:325] Total: 7179 MB
I0225 23:55:34.972862 20810 context_gpu.cu:321] GPU 0: 7323 MB
I0225 23:55:34.972884 20810 context_gpu.cu:325] Total: 7323 MB
I0225 23:55:34.987242 20807 context_gpu.cu:321] GPU 0: 7467 MB
I0225 23:55:34.987257 20807 context_gpu.cu:325] Total: 7467 MB
I0225 23:55:35.004983 20807 context_gpu.cu:321] GPU 0: 7611 MB
I0225 23:55:35.005004 20807 context_gpu.cu:325] Total: 7611 MB
I0225 23:55:35.019520 20807 context_gpu.cu:321] GPU 0: 7755 MB
I0225 23:55:35.019529 20807 context_gpu.cu:325] Total: 7755 MB
I0225 23:55:35.033624 20807 context_gpu.cu:321] GPU 0: 7899 MB
I0225 23:55:35.033632 20807 context_gpu.cu:325] Total: 7899 MB
I0225 23:55:35.048848 20808 context_gpu.cu:321] GPU 0: 8043 MB
I0225 23:55:35.048869 20808 context_gpu.cu:325] Total: 8043 MB
I0225 23:55:35.065871 20807 context_gpu.cu:321] GPU 0: 8187 MB
I0225 23:55:35.065881 20807 context_gpu.cu:325] Total: 8187 MB
I0225 23:55:35.082967 20807 context_gpu.cu:321] GPU 0: 8331 MB
I0225 23:55:35.082975 20807 context_gpu.cu:325] Total: 8331 MB
I0225 23:55:35.102628 20810 context_gpu.cu:321] GPU 0: 8467 MB
I0225 23:55:35.102646 20810 context_gpu.cu:325] Total: 8467 MB
I0225 23:55:35.123090 20807 context_gpu.cu:321] GPU 0: 8607 MB
I0225 23:55:35.123100 20807 context_gpu.cu:325] Total: 8607 MB
I0225 23:55:35.145066 20807 context_gpu.cu:321] GPU 0: 8739 MB
I0225 23:55:35.145074 20807 context_gpu.cu:325] Total: 8739 MB
I0225 23:55:35.166004 20807 context_gpu.cu:321] GPU 0: 8871 MB
I0225 23:55:35.166013 20807 context_gpu.cu:325] Total: 8871 MB
I0225 23:55:35.187448 20807 context_gpu.cu:321] GPU 0: 9003 MB
I0225 23:55:35.187456 20807 context_gpu.cu:325] Total: 9003 MB
I0225 23:55:35.208040 20807 context_gpu.cu:321] GPU 0: 9135 MB
I0225 23:55:35.208050 20807 context_gpu.cu:325] Total: 9135 MB
I0225 23:55:35.229956 20807 context_gpu.cu:321] GPU 0: 9267 MB
I0225 23:55:35.229964 20807 context_gpu.cu:325] Total: 9267 MB
I0225 23:55:35.251646 20807 context_gpu.cu:321] GPU 0: 9399 MB
I0225 23:55:35.251655 20807 context_gpu.cu:325] Total: 9399 MB
I0225 23:55:35.273802 20807 context_gpu.cu:321] GPU 0: 9531 MB
I0225 23:55:35.273811 20807 context_gpu.cu:325] Total: 9531 MB
I0225 23:55:35.294629 20807 context_gpu.cu:321] GPU 0: 9660 MB
I0225 23:55:35.294638 20807 context_gpu.cu:325] Total: 9660 MB
I0225 23:55:35.320922 20808 context_gpu.cu:321] GPU 0: 9795 MB
I0225 23:55:35.320945 20808 context_gpu.cu:325] Total: 9795 MB
I0225 23:55:35.346731 20809 context_gpu.cu:321] GPU 0: 9934 MB
I0225 23:55:35.346740 20809 context_gpu.cu:325] Total: 9934 MB
I0225 23:55:35.430550 20807 context_gpu.cu:321] GPU 0: 10068 MB
I0225 23:55:35.430560 20807 context_gpu.cu:325] Total: 10068 MB
I0225 23:55:35.566123 20809 context_gpu.cu:321] GPU 0: 10200 MB
I0225 23:55:35.566140 20809 context_gpu.cu:325] Total: 10200 MB
I0225 23:55:35.958365 20807 context_gpu.cu:321] GPU 0: 10332 MB
I0225 23:55:35.958379 20807 context_gpu.cu:325] Total: 10332 MB
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at context_gpu.cu:343] error == cudaSuccess. 2 vs 0. Error at: /home/user/workspace/caffe2/caffe2/core/context_gpu.cu:343: out of memory Error from operator: 
input: "gpu_0/res2_2_sum" input: "gpu_0/res3_0_branch1_w" input: "gpu_0/__m14_shared" output: "gpu_0/res3_0_branch1_w_grad" output: "gpu_0/__m13_shared" name: "" type: "ConvGradient" arg { name: "no_bias" i: 1 } arg { name: "stride" i: 2 } arg { name: "exhaustive_search" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "kernel" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" is_gradient_op: true
*** Aborted at 1519599336 (unix time) try "date -d @1519599336" if you are using GNU date ***
PC: @     0x7fd60edf8428 gsignal
*** SIGABRT (@0x3ea0000510d) received by PID 20749 (TID 0x7fd4b7fff700) from PID 20749; stack trace: ***
    @     0x7fd60f19e390 (unknown)
    @     0x7fd60edf8428 gsignal
    @     0x7fd60edfa02a abort
    @     0x7fd60c66484d __gnu_cxx::__verbose_terminate_handler()
    @     0x7fd60c6626b6 (unknown)
    @     0x7fd60c662701 std::terminate()
    @     0x7fd60c68dd38 (unknown)
    @     0x7fd60f1946ba start_thread
I0225 23:55:36.066072 20807 context_gpu.cu:321] GPU 0: 10524 MB
I0225 23:55:36.066085 20807 context_gpu.cu:325] Total: 10524 MB
    @     0x7fd60eeca41d clone
    @                0x0 (unknown)

Source

kampelmuehler

Most helpful comment

I add the same kind of issue (except it was not random). I solve this by lowering the amount of memory required modifying the config.yaml.
In my case, I change the MAX_SIZE parameter in TRAIN from 1333 (baselines) to 833.
I think you could also lower SCALES and BATCH_SIZE.

Hope it will help

francoto on 8 Mar 2018

👍3

All 4 comments

Hope it will help

francoto on 8 Mar 2018

👍3

@francoto thanks for the input. Indeed reducing the batch size could hinder this problem, but it will also impact model performance. Also the batchsize easily fits inside the GPU memory, but at some random point during training (usually after ~16k iterations) the memory usage suddenly increases and training crashes, which is strange behavior. I haven't yet had time to look into what triggers the context_gpu model to fire up though.

kampelmuehler on 8 Mar 2018

The problem occurs me when I do:
~/detectron$ CUDA_VISIBLE_DEVICES=0 python2 tools/train_net.py --cfg configs/04_2018_gn_baselines/scratch_e2e_mask_rcnn_R-50-FPN_3x_gn.yaml OUTPUT_DIR ~/tmp/detectron-output

Found Detectron ops lib: /home/intern/usr/local/lib/libcaffe2_detectron_ops_gpu.so
Found Detectron ops lib: /home/intern/usr/local/lib/libcaffe2_detectron_ops_gpu.so
E0504 22:55:48.136441 8525 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0504 22:55:48.136483 8525 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0504 22:55:48.136489 8525 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO train_net.py: 95: Called with args:
INFO train_net.py: 96: Namespace(cfg_file='configs/04_2018_gn_baselines/scratch_e2e_mask_rcnn_R-50-FPN_3x_gn.yaml', multi_gpu_testing=False, opts=['OUTPUT_DIR', '/home/intern/tmp/detectron-output'], skip_test=False)
INFO train_net.py: 102: Training with config:
INFO train_net.py: 103: {'BBOX_XFORM_CLIP': 4.135166556742356,

...

INFO train.py: 131: Building model: generalized_rcnn
WARNING cnn.py: 25: [====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information.
WARNING memonger.py: 55: NOTE: Executing memonger to optimize gradient memory
I0504 22:55:51.732862 8525 memonger.cc:236] Remapping 140 using 24 shared blobs.
...
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
what(): [enforce fail at context_gpu.h:156] . Encountered CUDA error: invalid device ordinal
* Aborted at 1525445757 (unix time) try "date -d @1525445757" if you are using GNU date
PC: @ 0x7fa26afec428 gsignal
SIGABRT (@0x3f40000214d) received by PID 8525 (TID 0x7fa26c4da740) from PID 8525; stack trace: *
@ 0x7fa26baa2390 (unknown)
@ 0x7fa26afec428 gsignal
@ 0x7fa26afee02a abort
@ 0x7fa26ad12b39 __gnu_cxx::__verbose_terminate_handler()
@ 0x7fa26ad111fb __cxxabiv1::__terminate()
@ 0x7fa26ad10640 __cxa_call_terminate
@ 0x7fa26ad10e6f __gxx_personality_v0
@ 0x7fa26aa77564 _Unwind_RaiseException_Phase2
@ 0x7fa26aa7781d _Unwind_RaiseException
@ 0x7fa26ad11409 __cxa_throw
@ 0x7fa25379a109 caffe2::CUDAContext::~CUDAContext()
@ 0x7fa253939412 caffe2::Operator<>::~Operator()
@ 0x7fa2539e1bee caffe2::FillerOp<>::~FillerOp()
@ 0x7fa2539e58f6 caffe2::XavierFillOp<>::~XavierFillOp()
@ 0x7fa2539e5926 caffe2::XavierFillOp<>::~XavierFillOp()
@ 0x7fa252801809 std::vector<>::~vector()
@ 0x7fa2527fffcf caffe2::SimpleNet::SimpleNet()
@ 0x7fa2527cb1a6 caffe2::CreateNet()
@ 0x7fa2527cb8fd caffe2::CreateNet()
@ 0x7fa252835532 caffe2::Workspace::RunNetOnce()
@ 0x7fa25525e1ba _ZZN6caffe26python16addGlobalMethodsERN8pybind116moduleEENKUlRKNS1_5bytesEE28_clES6_.isra.2767.constprop.2859
@ 0x7fa25525e455 _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKNS_5bytesEE28_bJS8_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESQ_
@ 0x7fa25528b24d pybind11::cpp_function::dispatcher()
@ 0x7fa26bd8f9c0 PyEval_EvalFrameEx
@ 0x7fa26bd92519 PyEval_EvalCodeEx
@ 0x7fa26bd8f4b2 PyEval_EvalFrameEx
@ 0x7fa26bd92519 PyEval_EvalCodeEx
@ 0x7fa26bd8f4b2 PyEval_EvalFrameEx
@ 0x7fa26bd92519 PyEval_EvalCodeEx
@ 0x7fa26bd8f4b2 PyEval_EvalFrameEx
@ 0x7fa26bd92519 PyEval_EvalCodeEx
@ 0x7fa26bd8f4b2 PyEval_EvalFrameEx
Aborted (core dumped)

I will add my environment info later.

easysam on 4 May 2018

👍2

same problem here, but I have 100 Gb of free space... is there a specific memory used by my GPU? should I get a better GPU?
Btw, I already trained the model once and it didn't give me this error