Detectron: Running out of memory on a 4GB card

Created on 24 Jan 2018 · 24Comments · Source: facebookresearch/Detectron

I'm trying to run Faster-RCNN on a Nvidia GTX 1050Ti, but I'm running out of memory. Nvidia-smi says that about 170MB are already in use, but does Faster-RCNN really use 3.8GB of VRAM to process an image?

I tried Mask-RCNN too (the model in the getting started tutorial) and got about 4 images in (5 if I closed my browser) before it crashed.

Is this a bug or does it really just need more than 4GB of memory?

INFO infer_simple.py: 111: Processing demo/18124840932_e42b3e377c_k.jpg -> /home/px046/prog/Detectron/output/18124840932_e42b3e377c_k.jpg.pdf
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at blob.h:94] IsType<T>(). wrong type for the Blob instance. Blob contains nullptr (uninitialized) while caller expects caffe2::Tensor<caffe2::CUDAContext> .
Offending Blob name: gpu_0/conv_rpn_w.
Error from operator: 
input: "gpu_0/res4_5_sum" input: "gpu_0/conv_rpn_w" input: "gpu_0/conv_rpn_b" output: "gpu_0/conv_rpn" name: "" type: "Conv" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
*** Aborted at 1516787658 (unix time) try "date -d @1516787658" if you are using GNU date ***
PC: @     0x7f08de455428 gsignal
*** SIGABRT (@0x3e800000932) received by PID 2354 (TID 0x7f087cda9700) from PID 2354; stack trace: ***
    @     0x7f08de4554b0 (unknown)
    @     0x7f08de455428 gsignal
    @     0x7f08de45702a abort
    @     0x7f08d187db39 __gnu_cxx::__verbose_terminate_handler()
    @     0x7f08d187c1fb __cxxabiv1::__terminate()
    @     0x7f08d187c234 std::terminate()
    @     0x7f08d1897c8a execute_native_thread_routine_compat
    @     0x7f08def016ba start_thread
    @     0x7f08de52741d clone
    @                0x0 (unknown)
Aborted (core dumped)

enhancement

Source

Omegastick

Most helpful comment

One additional note: the current implementation uses memory optimizations during training, but not during inference. In the case of inference, it is possible to substantially reduce memory usage since intermediate activations are not needed once they are consumed. We will consider adding inference-only memory optimization in the future.

rbgirshick on 24 Jan 2018

👍22 🎉7

All 24 comments

Hi @Omegastick, the memory requirements of the Faster R-CNN algorithm vary depending on a number of factors including the backbone network architecture and test image scales used.

For example, you can run Faster R-CNN with the default ResNet-50 config using:

python2 tools/infer_simple.py \
  --cfg configs/12_2017_baselines/e2e_faster_rcnn_R-50-FPN_2x.yaml \
  --output-dir /tmp/detectron-visualizations \ 
  --image-ext jpg \
  --wts https://s3-us-west-2.amazonaws.com/detectron/35857389/12_2017_baselines/e2e_faster_rcnn_R-50-FPN_2x.yaml.01_37_22.KSeq0b5q/output/train/coco_2014_train%3Acoco_2014_valminusminival/generalized_rcnn/model_final.pkl \
  demo

which should not require more than 3GB to run on the demo images.

ir413 on 24 Jan 2018

👍3

rbgirshick on 24 Jan 2018

👍22 🎉7

@Omegastick Tested on my machine, both Faster RCNN- resnet 101 and Mask RCNN- resnet 101 use around 4GB GPU memory.

mattdingmeng on 24 Jan 2018

👍1

@ir413 Thanks, the model you linked works great (running at 2.5GB VRAM usage) on my machine.

Omegastick on 25 Jan 2018

It would be cool if inference didnt need a GPU at all.

samhodge on 27 Jan 2018

👍5

how can l run the mask-rcnn with a 2G memory GPU? can anyone help me ?

x-codingman on 28 Jan 2018

👍15

Is this problem due to the implementation of Caffe 2 or Detectron? Which files in Detectron should I look at in order to solve this problem?

pacowong on 17 Mar 2018

@rbgirshick

In the case of inference, it is possible to substantially reduce memory usage since intermediate activations are not needed once they are consumed. We will consider adding inference-only memory optimization in the future.

Is there already something implemented in PyTorch/Caffe2 ? If yes where do we need to dig ?

gadcam on 29 May 2018

@gadcam This has been on my todo list for a long time, but unfortunately its priority has been decreasing instead of increasing :/. I think that caffe2.python.memonger.release_blobs_when_used (https://github.com/pytorch/pytorch/blob/master/caffe2/python/memonger.py#L229) should implement most of what we need. However there are some non-trivial issues that need to be addressed:

For some networks (e.g. Mask R-CNN) multiple nets are used at inference time and therefore not all activations can be freed by reasoning over only one graph (because they may be needed by another graph, e.g., the mask head net).
This function requires the use of a caching memory manager, which we have not tested, so there could be issues with simply turning that on.

rbgirshick on 29 May 2018

@rbgirshick Thank you for your detailed explanation!

So as I understand it, for us release_blobs_when_used acts as a converter from a regular Proto to a "memory optimized" one.

For some networks (e.g. Mask R-CNN) multiple nets are used at inference time and therefore not all activations can be freed by reasoning over only one graph (because they may be needed by another graph, e.g., the mask head net).

Said in other words we have to fill dont_free_blobs with blobs used by the second stage ?

This function requires the use of a caching memory manager, which we have not tested, so there could be issues with simply turning that on.

So if we want to test it we would need to set FLAGS_caffe2_cuda_memory_pool to cub (or thc) but could we do this in Python ?
One of the very scarce reference to it I could find is here https://github.com/pytorch/pytorch/blob/6223bfdb1d3273a57b58b2a04c25c6114eaf3911/caffe2/core/context_gpu.cu#L190

gadcam on 29 May 2018

@gadcam

So as I understand it, for us release_blobs_when_used acts as a converter from a regular Proto to a "memory optimized" one.

Yes, that's correct. It analyzes the computation graph, determines when each blob will no longer be used, and then inserts a memory freeing op.

Said in other words we have to fill dont_free_blobs with blobs used by the second stage ?

Yes, with the caveat that I'm not sure how well used and/or tested this function is...from grepping code it seems that it's not really used. Thus I would keep in mind that it might not work as expected.

So if we want to test it we would need to set FLAGS_caffe2_cuda_memory_pool to cub (or thc) but could we do this in Python ?

Yes. I think the newly added thc memory manager is more efficient. We needed to use it instead of cub for a recent (though different) use case.

rbgirshick on 29 May 2018

@rbgirshick You are right, it looks like a risky path!

Yes. I think the newly added thc memory manager is more efficient. We needed to use it instead of cub for a recent (though different) use case.

What I meant is do you know where I can find documentation to do it or do you have an example ? (I am really sorry to insist on this one, maybe I missed something but I could not find any documentation on it)

gadcam on 29 May 2018

@gadcam regarding documentation, not that I'm aware of. Sorry!

rbgirshick on 29 May 2018

@asaadaldien I am really sorry to annoy you but you seem to be one of the few people that advised to

MAKE SURE caffe2_cuda_memory_pool is set

when we use memonger or data_parallel_model (for reference it was here).
Do you have any hint on how to ensure we have a caching memory manager enable ? (Using Caffe2 in Python)

gadcam on 30 May 2018

@gadcam You can enable cub cached allocator by passing cub to caffe2_cuda_memory_pool flag. e.g :

workspace.GlobalInit([
'--caffe2_cuda_memory_pool=cub',
])

However this is required only when using dynamic memory memonger.

asaadaldien on 30 May 2018

@asaadaldien
It would have taken me a lot of time to figure out how to do as there is no documentation about GlobalInit.
Thank you very much for your help! So now I am able to begin some experiments!

gadcam on 30 May 2018

I have a simple solution to this problem.
You could set 'P2~P5' and 'rois' as output blobs, not only the middle blob, then it will not be optimized when using memory optimization.

xmyqsh on 31 May 2018

Does not seem to work for me.
The model I tested is e2e_keypoint_rcnn_R-50-FPN_s1x.yaml.
I tried to test it against the model.net part.

I used infer_simple.py for the tests.

workspace.GlobalInit(['caffe2', '--caffe2_log_level=0', '--caffe2_cuda_memory_pool=thc'])

and

dont_free_blobs = set(model.net.Proto().external_output)
expect_frees = set(i for op in model.net.Proto().op for i in op.input)
expect_frees -= dont_free_blobs

opti_net = release_blobs_when_used(model.net.Proto(), dont_free_blobs, selector_fun=None)
model.net.Proto().op.extend(copy.deepcopy(opti_net.op))

test_release_blobs_when_used(model.net.Proto(), expect_frees)

where test_release_blobs_when_used is inspired by https://github.com/pytorch/pytorch/blob/bf58bb5e59fa64fb49d77467f3466c6bc0cc76c5/caffe2/python/memonger_test.py#L731

def test_release_blobs_when_used(with_frees, expect_frees):
    found_frees = set()
    for op in with_frees.op:
        if op.type == "Free":
            print("OP FREEE", op)
            assert(not op.input[0] in found_frees)  # no double frees
            found_frees.add(op.input[0])
        else:
            # Check a freed blob is not used anymore
            for inp in op.input:
                assert(not inp in found_frees)
            for outp in op.output:
                assert(not outp in found_frees)

    try:
        assert(expect_frees == found_frees)
    except:
        print("Found - Expect frees Nb=", len(found_frees - expect_frees), found_frees - expect_frees, "\n\n\n")
        print("Expect - Found frees Nb=", len(expect_frees - found_frees), expect_frees - found_frees, "\n\n\n")
       #assert(False)

Please note that dont_free_blobs is not set to a correct value!

This function tells me that no unexpected blob would be freed and that some are missing.
(which is normal because dont_free_blobs is not correct)
So I continue to run the model.

And... nothing happens. I checked by using save_graph function : the free ops are indeed here at the right place.

The memory usage for my sample inputs of this line is 1910 Mo +/- 5 Mo
https://github.com/facebookresearch/Detectron/blob/6c5835862888e784e861824e0ad6ac93dd01d8f5/detectron/core/test.py#L158

But something really surprising happens if I set the memory manager to CUB

workspace.GlobalInit(['caffe2', '--caffe2_log_level=0', '--caffe2_cuda_memory_pool=cub'])

The RAM usage of the RunNet line is increased by something like 3 Go!! (using the regular code or the custom one with the Free blobs)

I fail to understand what is going on...

gadcam on 5 Jun 2018

As described in #507 I am also facing an out of memory error when starting inference on Jetson TX1.
The solution described in this thread, like:
python2 tools/infer_simple.py \ --cfg configs/12_2017_baselines/e2e_faster_rcnn_R-50-FPN_2x.yaml \ --output-dir /tmp/detectron-visualizations \ --image-ext jpg \ --wts https://s3-us-west-2.amazonaws.com/detectron/35857389/12_2017_baselines/e2e_faster_rcnn_R-50-FPN_2x.yaml.01_37_22.KSeq0b5q/output/train/coco_2014_train%3Acoco_2014_valminusminival/generalized_rcnn/model_final.pkl \ demo
Also does not work, I still run out of memory, although I have a total of 4 GB RAM available (although CPU and GPU memory are shared).
Is there a smaller model still that I could try?
Since as @Omegastick described, it should only take up to 2.5 GB of memory, but it still does not seem to fit on the Jetson. Any other suggestions that I could try?

johannathiemich on 21 Jun 2018

👍2

@johannathiemich I got the same problem. There aren't any errors but the process killed. Have you solve the problem? I use Jetson TX1, too.

ll884856 on 8 Aug 2018

@ll884856 Yes, actually I did. I ended up exchanging the base net with a squeezenet and trained the net again. But keep in mind that the performance is much worse than with the original ResNet backbone.
What you could also try before echanging the basenet is to turn off the FPN that might help as well. But it will also reduce the performance although I would hope that the decrease will not be as bad.
If you would like I can give my implementation and weights of the squeezenet to you. I am currently working on my bachelor's thesis about this topic.

johannathiemich on 8 Aug 2018

@johannathiemich Thank you for your reply! In fact, I have just been in this field and I'm not very clear about the architecture of Mask R-CNN. If you can give me your implementation and weights, it will help me a lot to understand and implement Mask R-CNN. My e-mail is [email protected]
Thank you !

ll884856 on 9 Aug 2018

Yeah you can do Mask-RCNN on the CPU just not with detectron:

see:
https://vimeo.com/277180815

samhodge on 9 Aug 2018

I have one similar problem, so if there is anyone to help me here, I would really appreaciate it https://github.com/facebookresearch/detectron2/issues/1539 I really don't understand why this is happening. So, I need 9.3 GB of RAM for prediction of 25 images in a batch on cpu after including with torch.nograd() part in it.