Detectron: Various errors when training scales=320

Created on 5 May 2018  ·  22Comments  ·  Source: facebookresearch/Detectron

Expected results

Training runs correctly in any proper sizes.

Actual results

Training runs correctly for some iterations, then ends at a random time. I have disabled the shuffle of dataset by modify _shuffle_roidb_inds in lib/roi_data/loader.py and tried on VOC twice, the program crashed at different iterations respectively.

What's more, the error messages are different in different runs. Sometimes it is

*** Error in `python': double free or corruption (out): 0x00007f42fc228790 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f46092137e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f460921c37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f460922053c]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x1edef)[0x7f4600ca7def]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x22032)[0x7f4600cab032]
python(PyEval_EvalFrameEx+0x6162)[0x4ca0d2]
python(PyEval_EvalFrameEx+0x5e0f)[0x4c9d7f]
python(PyEval_EvalCodeEx+0x255)[0x4c2705]
python[0x4de69e]
python(PyObject_Call+0x43)[0x4b0c93]
python[0x4f452e]
python(PyObject_Call+0x43)[0x4b0c93]
python(PyEval_CallObjectWithKeywords+0x30)[0x4ce540]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x83d40)[0x7f45fe003d40
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x854c1)[0x7f45fe0054c1
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x4ca1b)[0x7f45fdfcca1b
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x98dd8)[0x7f45fe018dd8
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x95155)[0x7f45fe015155
]
/usr/local/lib/libcaffe2.so(_ZN6caffe26DAGNet5RunAtEiRKSt6vectorIiSaIiEE+0x5a)[0x7f45f5818c5a]
/usr/local/lib/libcaffe2.so(_ZN6caffe210DAGNetBase14WorkerFunctionEv+0x305)[0x7f45f5817a15]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f4603171c80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f460956d6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f46092a341d]
======= Memory map: ========
00400000-006e9000 r-xp 00000000 08:05 15099550902                        /usr/bin/python2.7
008e8000-008ea000 r--p 002e8000 08:05 15099550902                        /usr/bin/python2.7
008ea000-00961000 rw-p 002ea000 08:05 15099550902                        /usr/bin/python2.7
00961000-00984000 rw-p 00000000 00:00 0 
02372000-1bcfe000 rw-p 00000000 00:00 0                                  [heap]
200000000-200200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200200000-200400000 ---p 00000000 00:00 0 
200400000-200404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200404000-200600000 ---p 00000000 00:00 0 
200600000-200a00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
200a00000-201800000 ---p 00000000 00:00 0 
201800000-201804000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
201804000-201a00000 ---p 00000000 00:00 0 
201a00000-201e00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
201e00000-202c00000 ---p 00000000 00:00 0 
202c00000-202c04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
202c04000-202e00000 ---p 00000000 00:00 0 
202e00000-203200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
203200000-204000000 ---p 00000000 00:00 0 
204000000-204004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
204004000-204200000 ---p 00000000 00:00 0 
204200000-204600000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
204600000-205400000 ---p 00000000 00:00 0 
205400000-205404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
205404000-205600000 ---p 00000000 00:00 0 
205600000-205a00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
205a00000-206800000 ---p 00000000 00:00 0 
206800000-206804000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
206804000-206a00000 ---p 00000000 00:00 0 
206a00000-206e00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
206e00000-207c00000 ---p 00000000 00:00 0 
207c00000-207c04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
207c04000-207e00000 ---p 00000000 00:00 0 
207e00000-208200000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
208200000-209000000 ---p 00000000 00:00 0 
209000000-209004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
209004000-209200000 ---p 00000000 00:00 0 
209200000-209600000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
209600000-20a400000 ---p 00000000 00:00 0 
20a400000-20a404000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20a404000-20a600000 ---p 00000000 00:00 0 
20a600000-20aa00000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20aa00000-20aa04000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20aa04000-20ac00000 ---p 00000000 00:00 0 
20ac00000-20b000000 rw-s 00000000 00:05 154858                           /dev/nvidiactl
20b000000-20b004000 rw-s 00000000 00:05 154858                           /dev/nvidiactl

and sometimes it is

*** Aborted at 1525523656 (unix time) try "date -d @1525523656" if you are using GNU date ***
PC: @     0x7f7c0376048a (unknown)
*** SIGSEGV (@0x0) received by PID 89364 (TID 0x7f79559fb700) from PID 0; stack trace: ***
    @     0x7f7c03abd390 (unknown)
    @     0x7f7c0376048a (unknown)
    @     0x7f7c03763cde (unknown)
    @     0x7f7c03766184 __libc_malloc
    @     0x7f7b7f400a36 (unknown)
    @     0x7f7b7f979634 (unknown)
    @     0x7f7b7fa10d34 (unknown)
    @     0x7f7b7fa131b7 (unknown)
    @     0x7f7b7f409ecb (unknown)
    @     0x7f7b7f40a40c cudnnConvolutionBackwardFilter
    @     0x7f7bc570ac3b _ZN6caffe210CuDNNState7executeIRZNS_19CudnnConvGradientOp13DoRunWithTypeIfffffffEEbvEUlPS0_E1_EEvP11CUstream_stOT_
    @     0x7f7bc571337c caffe2::CudnnConvGradientOp::DoRunWithType<>()
    @     0x7f7bc56fead0 caffe2::CudnnConvGradientOp::RunOnDevice()
    @     0x7f7bc568694b caffe2::Operator<>::Run()
    @     0x7f7bf797ec5a caffe2::DAGNet::RunAt()
    @     0x7f7bf797da15 caffe2::DAGNetBase::WorkerFunction()
    @     0x7f7bfd6b7c80 (unknown)
    @     0x7f7c03ab36ba start_thread
    @     0x7f7c037e941d clone
    @                0x0 (unknown)

Detailed steps to reproduce

In an existing config, modify TRAIN.SCALES to (320,), TRAIN.MAX_SIZE to 500. Since I was using an FPN config, I modified FPN.RPN_ANCHOR_START_SIZE to 16 and ROI_CANONICAL_SCALE to 90.

I have tested on COCO and VOC, both fails.

System information

  • Operating system: Ubuntu 16.04
  • Compiler version: gcc 5.4.0
  • CUDA version: 9.1
  • cuDNN version: 7
  • NVIDIA driver version: 387.26
  • GPU models (for all devices if they are not all the same): P40 x 4
  • PYTHONPATH environment variable: null
  • python --version output: Python 2.7.12

Most helpful comment

I think that I found the problem: It is in detectron/utils/cython_nms.pyx:

cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

The numpy.argsort function seems to be buggy at this point (no clue why). I replaced it with the cython argsort implementation from https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx. To make it work the following changes are necessary:

  • Place argsort.pyx in detectron/utils
  • Change line 13 in argsort.pyx to
    ctypedef cnp.float32_t FLOAT_t
  • register the file in setup.py (similar to cython_nms.pyx and cython_bbox.pyx)
  • include it in detectron/utils/cython_nms.pyx, i.e. change the file as follows:
    import utils.argsort as argsort
    ...
    cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
    argsort.argsort(-scores, order)
  • and run 'make' in detectron to compile the cython-modules again.

For me the training works fine now for 20k iterations and also the inference had no more seg-faults (including a little speed-up ;-) )

The open question still is why cython_nms.pyx worked fine for other configurations of the TRAIN/TEST.SCALE? In my experience not the image-scale was the problem but the size of objects that are getting very small in the case of rescaled images to small sizes.

Hope that helps!

PS: My current Detectron-version is quite far from the master, therefore I'm not sure if I find time to launch a PR soon..

All 22 comments

I have the same problem with COCO dataset and sometimes it happens in test time too, not only while training.

And what's more when I reduced the TEST.RPN_PRE_NMS_TOP_N from 1000 to 100, sometimes (~1 in 5 times) the test runs fine, sometimes models (including my own model and the faster-rcnn-R-50-FPN_1x model in model zoo) crashed in test time when I ran test_net.py. The error messages are also various.

I am quite confused about ROI_CANONICAL_SCALE: 90,how can i get it ?

@moyans You can use grep -rn <detectron directory> -e "ROI_CANONICAL_SCALE" --include=*.py to get all lines containing "ROI_CANONICAL_SCALE" in .py files

@daquexian Sorry, I didn't explain it clearly. I know where it is. The original value is 224. I'm just wondering how to calculate this number.

@moyans I calculated it by 224 * 320 / 800.

Did you find a solution for your problem? I got something similar when I reduce the image-scale for my own dataset to 360x480.. I think it is related to the cython_nms, since if I configure TRAIN/TEST.RPN_NMS_THRESH: 0.0 The model trains and evaluates (but with worse results of course..).
Another workaround was to use a larger TRAIN/TEST.SCALE (e.g. 800, with TRAIN/TEST.MAX_SIZE: 1333)

I also tried to debug this, but I still couldn't figure out the problem... using the CPU NMS (without cython) from the old py-faster-rcnn repo I found that sometimes you divide by zero inside the NMS (when the two boxes both have area 0). This should be fixed by setting TRAIN/TEST.RPN_MIN_SIZE > 0. But it seems that this is not the only problem...

Could you please try to switch off the RPN_NMS (set RPN_NMS_THRESH to 0.0) and see if it works then?

Maybe it could also be a problem that the number of anchors/proposals is too small when we use NMS on the regressed boxes that were generated on very small feature maps (due to the reduced input image-size)

@pfollmann Thanks for your information! I may try it tomorrow. Does the bug still exist even though TRAIN/TEST.RPN_MIN_SIZE > 0?

Yes, unfortunately with TRAIN/TEST.RPN_MIN_SIZE I still got errors at random iterations in your above-mentioned style..

I think that I found the problem: It is in detectron/utils/cython_nms.pyx:

cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

The numpy.argsort function seems to be buggy at this point (no clue why). I replaced it with the cython argsort implementation from https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx. To make it work the following changes are necessary:

  • Place argsort.pyx in detectron/utils
  • Change line 13 in argsort.pyx to
    ctypedef cnp.float32_t FLOAT_t
  • register the file in setup.py (similar to cython_nms.pyx and cython_bbox.pyx)
  • include it in detectron/utils/cython_nms.pyx, i.e. change the file as follows:
    import utils.argsort as argsort
    ...
    cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
    argsort.argsort(-scores, order)
  • and run 'make' in detectron to compile the cython-modules again.

For me the training works fine now for 20k iterations and also the inference had no more seg-faults (including a little speed-up ;-) )

The open question still is why cython_nms.pyx worked fine for other configurations of the TRAIN/TEST.SCALE? In my experience not the image-scale was the problem but the size of objects that are getting very small in the case of rescaled images to small sizes.

Hope that helps!

PS: My current Detectron-version is quite far from the master, therefore I'm not sure if I find time to launch a PR soon..

@pfollmann Wow! So coool! I will try it soon. Thanks!

On Thu, May 17, 2018, 10:57 PM pfollmann notifications@github.com wrote:

I think that I found the problem: It is in detectron/utils/cython_nms.pyx:

cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

The numpy.argsort function seems to be buggy at this point (no clue why).
I replaced it with the cython argsort implementation from
https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx.
To make it work the following changes are necessary:

  • Place argsort.pyx in detectron/utils
  • Change line 13 in argsort.pyx to
    ctypedef cnp.float32_t FLOAT_t
  • register the file in setup.py (similar to cython_nms.pyx and
    cython_bbox.pyx)
  • include it in detectron/utils/cython_nms.pyx, i.e. change the file
    as follows:
    import utils.argsort as argsort ... cdef np.ndarray[np.int_t, ndim=1]
    order = np.empty((ndets), dtype=np.intp) argsort.argsort(-scores, order)
  • and run 'make' in detectron to compile the cython-modules again.

For me the training works fine now for 20k iterations and also the
inference had no more seg-faults (including a little speed-up ;-) )

The open question still is why cython_nms.pyx worked fine for other
configurations of the TRAIN/TEST.SCALE? In my experience not the
image-scale was the problem but the size of objects that are getting very
small in the case of rescaled images to small sizes.

Hope that helps!

PS: My current Detectron-version is quite far from the master, therefore
I'm not sure if I find time to launch a PR soon..


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/facebookresearch/Detectron/issues/415#issuecomment-389896464,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALEcn-6IxvfvqGsyY2sjINVV_0D92lXdks5tzY_cgaJpZM4Tzn30
.

@pfollmann It works! Thanks! I'd like to remain this issue open because the patch has not been merged into master.

Looking forward to your PR :) You might want to fetch the master and modify the corresponding files. The steps on master are nothing different from those you pointed out above.

@pfollmann Thanks, save my day!

hi,i meet this problem when use the above method,do you have any methhods to solve it,thank you very much

import detectron.utils.cython_nms as cython_nms
File "detectron/utils/cython_nms.pyx", line 27, in init detectron.utils.cython_nms
import utils.argsort as argsort
ImportError: No module named utils.argsort

@shenghsiaowong I think you should use import detectron.utils.argsort as argsort because the project structure changed after @pfollmann posted his solution.

,,,i have change it,but it does not work,i know thisis a small issue ,but ihave no idea
File "detectron/utils/cython_nms.pyx", line 27, in init detectron.utils.cython_nms
#import detectron.utils.argsort as argsort
ImportError: No module named utils.argsort

what is meaning of this?thank you
import detectron.utils.cython_nms as cython_nms
File "detectron/utils/cython_nms.pyx", line 28, in init detectron.utils.cython_nms
import detectron.utils.argsort as argsort
ImportError: dynamic module does not define init function (initargsort)

@shenghsiaowong sorry I haven't met this error. @pfollmann do you have time to send a PR for your excellent solution so that every user can benefit from it seamlessly? :)

@pfollmann @daquexian Hi, thanks for the solution, but when I try your advice, other error happened: Error in 'python': free() invalid next size (fast)
Could you give any appreciate advice about it?

@shenghsiaowong
try to paste "import detectron.utils.argsort as argsort"
after
cimport cython
import numpy as np
cimport numpy as np

@karenyun I am meeting same problem here, did you figure it out?

@karenyun @StepOITD I met the same problem and solve it as follows:
(1) changing the file detectron/utils/cython_nms.pyx as pfollmann suggested
(2) putting the following sentences after the definition of ndets. In other words, change the original code snippet:

    cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]

    cdef int ndets = dets.shape[0]
    cdef np.ndarray[np.int_t, ndim=1] suppressed = \
            np.zeros((ndets), dtype=np.int)

to

    cdef int ndets = dets.shape[0]
    cdef np.ndarray[np.int_t, ndim=1] suppressed = \
            np.zeros((ndets), dtype=np.int)

    cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
    argsort.argsort(-scores, order) 

Hope it helps.

Was this page helpful?
0 / 5 - 0 ratings