Detectron: Various errors when training scales=320

Created on 5 May 2018 · 22Comments · Source: facebookresearch/Detectron

Expected results

Training runs correctly in any proper sizes.

Actual results

Training runs correctly for some iterations, then ends at a random time. I have disabled the shuffle of dataset by modify _shuffle_roidb_inds in lib/roi_data/loader.py and tried on VOC twice, the program crashed at different iterations respectively.

What's more, the error messages are different in different runs. Sometimes it is

*** Error in `python': double free or corruption (out): 0x00007f42fc228790 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f46092137e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f460921c37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f460922053c]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x1edef)[0x7f4600ca7def]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x22032)[0x7f4600cab032]
python(PyEval_EvalFrameEx+0x6162)[0x4ca0d2]
python(PyEval_EvalFrameEx+0x5e0f)[0x4c9d7f]
python(PyEval_EvalCodeEx+0x255)[0x4c2705]
python[0x4de69e]
python(PyObject_Call+0x43)[0x4b0c93]
python[0x4f452e]
python(PyObject_Call+0x43)[0x4b0c93]
python(PyEval_CallObjectWithKeywords+0x30)[0x4ce540]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x83d40)[0x7f45fe003d40
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x854c1)[0x7f45fe0054c1
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x4ca1b)[0x7f45fdfcca1b
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x98dd8)[0x7f45fe018dd8
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x95155)[0x7f45fe015155
]
/usr/local/lib/libcaffe2.so(_ZN6caffe26DAGNet5RunAtEiRKSt6vectorIiSaIiEE+0x5a)[0x7f45f5818c5a]
/usr/local/lib/libcaffe2.so(_ZN6caffe210DAGNetBase14WorkerFunctionEv+0x305)[0x7f45f5817a15]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f4603171c80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f460956d6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f46092a341d]
======= Memory map: ========
00400000-006e9000 r-xp 00000000 08:05 15099550902                        /usr/bin/python2.7
008e8000-008ea000 r--p 002e8000 08:05 15099550902                        /usr/bin/python2.7
008ea000-00961000 rw-p 002ea000 08:05 15099550902                        /usr/bin/python2.7
00961000-00984000 rw-p 00000000 00:00 0 
02372000-1bcfe000 rw-p 00000000 00:00 0                                  [heap]
200000000-200200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200200000-200400000 ---p 00000000 00:00 0 
200400000-200404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200404000-200600000 ---p 00000000 00:00 0 
200600000-200a00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200a00000-201800000 ---p 00000000 00:00 0 
201800000-201804000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
201804000-201a00000 ---p 00000000 00:00 0 
201a00000-201e00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
201e00000-202c00000 ---p 00000000 00:00 0 
202c00000-202c04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
202c04000-202e00000 ---p 00000000 00:00 0 
202e00000-203200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
203200000-204000000 ---p 00000000 00:00 0 
204000000-204004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
204004000-204200000 ---p 00000000 00:00 0 
204200000-204600000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
204600000-205400000 ---p 00000000 00:00 0 
205400000-205404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
205404000-205600000 ---p 00000000 00:00 0 
205600000-205a00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
205a00000-206800000 ---p 00000000 00:00 0 
206800000-206804000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
206804000-206a00000 ---p 00000000 00:00 0 
206a00000-206e00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
206e00000-207c00000 ---p 00000000 00:00 0 
207c00000-207c04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
207c04000-207e00000 ---p 00000000 00:00 0 
207e00000-208200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
208200000-209000000 ---p 00000000 00:00 0 
209000000-209004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
209004000-209200000 ---p 00000000 00:00 0 
209200000-209600000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
209600000-20a400000 ---p 00000000 00:00 0 
20a400000-20a404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20a404000-20a600000 ---p 00000000 00:00 0 
20a600000-20aa00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20aa00000-20aa04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20aa04000-20ac00000 ---p 00000000 00:00 0 
20ac00000-20b000000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20b000000-20b004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl

and sometimes it is

*** Aborted at 1525523656 (unix time) try "date -d @1525523656" if you are using GNU date ***
PC: @     0x7f7c0376048a (unknown)
*** SIGSEGV (@0x0) received by PID 89364 (TID 0x7f79559fb700) from PID 0; stack trace: ***
    @     0x7f7c03abd390 (unknown)
    @     0x7f7c0376048a (unknown)
    @     0x7f7c03763cde (unknown)
    @     0x7f7c03766184 __libc_malloc
    @     0x7f7b7f400a36 (unknown)
    @     0x7f7b7f979634 (unknown)
    @     0x7f7b7fa10d34 (unknown)
    @     0x7f7b7fa131b7 (unknown)
    @     0x7f7b7f409ecb (unknown)
    @     0x7f7b7f40a40c cudnnConvolutionBackwardFilter
    @     0x7f7bc570ac3b _ZN6caffe210CuDNNState7executeIRZNS_19CudnnConvGradientOp13DoRunWithTypeIfffffffEEbvEUlPS0_E1_EEvP11CUstream_stOT_
    @     0x7f7bc571337c caffe2::CudnnConvGradientOp::DoRunWithType<>()
    @     0x7f7bc56fead0 caffe2::CudnnConvGradientOp::RunOnDevice()
    @     0x7f7bc568694b caffe2::Operator<>::Run()
    @     0x7f7bf797ec5a caffe2::DAGNet::RunAt()
    @     0x7f7bf797da15 caffe2::DAGNetBase::WorkerFunction()
    @     0x7f7bfd6b7c80 (unknown)
    @     0x7f7c03ab36ba start_thread
    @     0x7f7c037e941d clone
    @                0x0 (unknown)

Detailed steps to reproduce

In an existing config, modify TRAIN.SCALES to (320,), TRAIN.MAX_SIZE to 500. Since I was using an FPN config, I modified FPN.RPN_ANCHOR_START_SIZE to 16 and ROI_CANONICAL_SCALE to 90.

I have tested on COCO and VOC, both fails.

System information

Operating system: Ubuntu 16.04
Compiler version: gcc 5.4.0
CUDA version: 9.1
cuDNN version: 7
NVIDIA driver version: 387.26
GPU models (for all devices if they are not all the same): P40 x 4
PYTHONPATH environment variable: null
python --version output: Python 2.7.12

Source

daquexian

Most helpful comment

I think that I found the problem: It is in detectron/utils/cython_nms.pyx:

cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

The numpy.argsort function seems to be buggy at this point (no clue why). I replaced it with the cython argsort implementation from https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx. To make it work the following changes are necessary:

Place argsort.pyx in detectron/utils
Change line 13 in argsort.pyx to
ctypedef cnp.float32_t FLOAT_t
register the file in setup.py (similar to cython_nms.pyx and cython_bbox.pyx)
include it in detectron/utils/cython_nms.pyx, i.e. change the file as follows:
import utils.argsort as argsort
...
cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
argsort.argsort(-scores, order)
and run 'make' in detectron to compile the cython-modules again.

For me the training works fine now for 20k iterations and also the inference had no more seg-faults (including a little speed-up ;-) )

The open question still is why cython_nms.pyx worked fine for other configurations of the TRAIN/TEST.SCALE? In my experience not the image-scale was the problem but the size of objects that are getting very small in the case of rescaled images to small sizes.

Hope that helps!

PS: My current Detectron-version is quite far from the master, therefore I'm not sure if I find time to launch a PR soon..

pfollmann on 17 May 2018

❤7 👍7 🎉5

All 22 comments

I have the same problem with COCO dataset and sometimes it happens in test time too, not only while training.

v-ilin on 6 May 2018

And what's more when I reduced the TEST.RPN_PRE_NMS_TOP_N from 1000 to 100, sometimes (~1 in 5 times) the test runs fine, sometimes models (including my own model and the faster-rcnn-R-50-FPN_1x model in model zoo) crashed in test time when I ran test_net.py. The error messages are also various.

daquexian on 6 May 2018

I am quite confused about ROI_CANONICAL_SCALE: 90，how can i get it ?

moyans on 15 May 2018

@moyans You can use grep -rn <detectron directory> -e "ROI_CANONICAL_SCALE" --include=*.py to get all lines containing "ROI_CANONICAL_SCALE" in .py files

daquexian on 15 May 2018

@daquexian Sorry, I didn't explain it clearly. I know where it is. The original value is 224. I'm just wondering how to calculate this number.

moyans on 15 May 2018

@moyans I calculated it by 224 * 320 / 800.

daquexian on 15 May 2018

Did you find a solution for your problem? I got something similar when I reduce the image-scale for my own dataset to 360x480.. I think it is related to the cython_nms, since if I configure TRAIN/TEST.RPN_NMS_THRESH: 0.0 The model trains and evaluates (but with worse results of course..).
Another workaround was to use a larger TRAIN/TEST.SCALE (e.g. 800, with TRAIN/TEST.MAX_SIZE: 1333)

I also tried to debug this, but I still couldn't figure out the problem... using the CPU NMS (without cython) from the old py-faster-rcnn repo I found that sometimes you divide by zero inside the NMS (when the two boxes both have area 0). This should be fixed by setting TRAIN/TEST.RPN_MIN_SIZE > 0. But it seems that this is not the only problem...

Could you please try to switch off the RPN_NMS (set RPN_NMS_THRESH to 0.0) and see if it works then?

Maybe it could also be a problem that the number of anchors/proposals is too small when we use NMS on the regressed boxes that were generated on very small feature maps (due to the reduced input image-size)

pfollmann on 16 May 2018

@pfollmann Thanks for your information! I may try it tomorrow. Does the bug still exist even though TRAIN/TEST.RPN_MIN_SIZE > 0?

daquexian on 16 May 2018

Yes, unfortunately with TRAIN/TEST.RPN_MIN_SIZE I still got errors at random iterations in your above-mentioned style..

pfollmann on 16 May 2018

I think that I found the problem: It is in detectron/utils/cython_nms.pyx:

cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

Place argsort.pyx in detectron/utils
Change line 13 in argsort.pyx to
ctypedef cnp.float32_t FLOAT_t
register the file in setup.py (similar to cython_nms.pyx and cython_bbox.pyx)
include it in detectron/utils/cython_nms.pyx, i.e. change the file as follows:
import utils.argsort as argsort
...
cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
argsort.argsort(-scores, order)
and run 'make' in detectron to compile the cython-modules again.

For me the training works fine now for 20k iterations and also the inference had no more seg-faults (including a little speed-up ;-) )

Hope that helps!

PS: My current Detectron-version is quite far from the master, therefore I'm not sure if I find time to launch a PR soon..

pfollmann on 17 May 2018

❤7 👍7 🎉5

@pfollmann Wow! So coool! I will try it soon. Thanks!

On Thu, May 17, 2018, 10:57 PM pfollmann notifications@github.com wrote:

I think that I found the problem: It is in detectron/utils/cython_nms.pyx:

cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

The numpy.argsort function seems to be buggy at this point (no clue why).
I replaced it with the cython argsort implementation from
https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx.
To make it work the following changes are necessary:

Place argsort.pyx in detectron/utils

Change line 13 in argsort.pyx to
ctypedef cnp.float32_t FLOAT_t

register the file in setup.py (similar to cython_nms.pyx and
cython_bbox.pyx)

include it in detectron/utils/cython_nms.pyx, i.e. change the file
as follows:
import utils.argsort as argsort ... cdef np.ndarray[np.int_t, ndim=1]
order = np.empty((ndets), dtype=np.intp) argsort.argsort(-scores, order)

and run 'make' in detectron to compile the cython-modules again.

For me the training works fine now for 20k iterations and also the
inference had no more seg-faults (including a little speed-up ;-) )

The open question still is why cython_nms.pyx worked fine for other
configurations of the TRAIN/TEST.SCALE? In my experience not the
image-scale was the problem but the size of objects that are getting very
small in the case of rescaled images to small sizes.

Hope that helps!

PS: My current Detectron-version is quite far from the master, therefore
I'm not sure if I find time to launch a PR soon..

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/facebookresearch/Detectron/issues/415#issuecomment-389896464,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALEcn-6IxvfvqGsyY2sjINVV_0D92lXdks5tzY_cgaJpZM4Tzn30
.

daquexian on 17 May 2018

@pfollmann It works! Thanks! I'd like to remain this issue open because the patch has not been merged into master.

Looking forward to your PR :) You might want to fetch the master and modify the corresponding files. The steps on master are nothing different from those you pointed out above.

daquexian on 17 May 2018

@pfollmann Thanks, save my day!

lzhbrian on 30 May 2018

hi,i meet this problem when use the above method,do you have any methhods to solve it,thank you very much

import detectron.utils.cython_nms as cython_nms
File "detectron/utils/cython_nms.pyx", line 27, in init detectron.utils.cython_nms
import utils.argsort as argsort
ImportError: No module named utils.argsort

shenghsiaowong on 3 Sep 2018

@shenghsiaowong I think you should use import detectron.utils.argsort as argsort because the project structure changed after @pfollmann posted his solution.

daquexian on 3 Sep 2018

，，，i have change it，but it does not work，i know thisis a small issue ，but ihave no idea
File "detectron/utils/cython_nms.pyx", line 27, in init detectron.utils.cython_nms
#import detectron.utils.argsort as argsort
ImportError: No module named utils.argsort

shenghsiaowong on 3 Sep 2018

what is meaning of this？thank you
import detectron.utils.cython_nms as cython_nms
File "detectron/utils/cython_nms.pyx", line 28, in init detectron.utils.cython_nms
import detectron.utils.argsort as argsort
ImportError: dynamic module does not define init function (initargsort)

shenghsiaowong on 3 Sep 2018

@shenghsiaowong sorry I haven't met this error. @pfollmann do you have time to send a PR for your excellent solution so that every user can benefit from it seamlessly? :)

daquexian on 16 Sep 2018

@pfollmann @daquexian Hi, thanks for the solution, but when I try your advice, other error happened: Error in 'python': free() invalid next size (fast)
Could you give any appreciate advice about it?

karenyun on 4 Oct 2018

@shenghsiaowong
try to paste "import detectron.utils.argsort as argsort"
after
cimport cython
import numpy as np
cimport numpy as np

SHERLOCKLS on 29 Oct 2018

@karenyun I am meeting same problem here, did you figure it out?

StepOITD on 15 Nov 2018

@karenyun @StepOITD I met the same problem and solve it as follows:
(1) changing the file detectron/utils/cython_nms.pyx as pfollmann suggested
(2) putting the following sentences after the definition of ndets. In other words, change the original code snippet:

    cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

    cdef int ndets = dets.shape[0]
    cdef np.ndarray[np.int_t, ndim=1] suppressed = \
            np.zeros((ndets), dtype=np.int)

    cdef int ndets = dets.shape[0]
    cdef np.ndarray[np.int_t, ndim=1] suppressed = \
            np.zeros((ndets), dtype=np.int)

    cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
    argsort.argsort(-scores, order)

Hope it helps.

CSdidi on 28 Nov 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

About SCALE and MAX_SIZE

lilichu · 3Comments

Conda caffe2 and libcaffe2_detectron_ops_gpu.so not where it should be

baristahell · 3Comments

pre-trained weights from coco dataset

coldgemini · 3Comments

RuntimeError: [enforce fail at conv_op_cudnn.cc:811] status == CUDNN_STATUS_SUCCESS. 8 vs 0. , Error at: /pytorch/caffe2/operators/conv_op_cudnn.cc:811: CUDNN_STATUS_EXECUTION_FAILED

Emma0928 · 3Comments

ERROR: core/context_gpu.cu:343: out of memory Error from operator

743341 · 4Comments