Training runs correctly in any proper sizes.
Training runs correctly for some iterations, then ends at a random time. I have disabled the shuffle of dataset by modify _shuffle_roidb_inds
in lib/roi_data/loader.py
and tried on VOC twice, the program crashed at different iterations respectively.
What's more, the error messages are different in different runs. Sometimes it is
*** Error in `python': double free or corruption (out): 0x00007f42fc228790 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f46092137e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f460921c37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f460922053c]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x1edef)[0x7f4600ca7def]
/usr/local/lib/python2.7/dist-packages/numpy/core/multiarray.so(+0x22032)[0x7f4600cab032]
python(PyEval_EvalFrameEx+0x6162)[0x4ca0d2]
python(PyEval_EvalFrameEx+0x5e0f)[0x4c9d7f]
python(PyEval_EvalCodeEx+0x255)[0x4c2705]
python[0x4de69e]
python(PyObject_Call+0x43)[0x4b0c93]
python[0x4f452e]
python(PyObject_Call+0x43)[0x4b0c93]
python(PyEval_CallObjectWithKeywords+0x30)[0x4ce540]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x83d40)[0x7f45fe003d40
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x854c1)[0x7f45fe0054c1
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x4ca1b)[0x7f45fdfcca1b
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x98dd8)[0x7f45fe018dd8
]
/usr/local/lib/python2.7/dist-packages/caffe2/python/caffe2_pybind11_state_gpu.so(+0x95155)[0x7f45fe015155
]
/usr/local/lib/libcaffe2.so(_ZN6caffe26DAGNet5RunAtEiRKSt6vectorIiSaIiEE+0x5a)[0x7f45f5818c5a]
/usr/local/lib/libcaffe2.so(_ZN6caffe210DAGNetBase14WorkerFunctionEv+0x305)[0x7f45f5817a15]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f4603171c80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f460956d6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f46092a341d]
======= Memory map: ========
00400000-006e9000 r-xp 00000000 08:05 15099550902 /usr/bin/python2.7
008e8000-008ea000 r--p 002e8000 08:05 15099550902 /usr/bin/python2.7
008ea000-00961000 rw-p 002ea000 08:05 15099550902 /usr/bin/python2.7
00961000-00984000 rw-p 00000000 00:00 0
02372000-1bcfe000 rw-p 00000000 00:00 0 [heap]
200000000-200200000 rw-s 00000000 00:05 154858 /dev/nvidiactl
200200000-200400000 ---p 00000000 00:00 0
200400000-200404000 rw-s 00000000 00:05 154858 /dev/nvidiactl
200404000-200600000 ---p 00000000 00:00 0
200600000-200a00000 rw-s 00000000 00:05 154858 /dev/nvidiactl
200a00000-201800000 ---p 00000000 00:00 0
201800000-201804000 rw-s 00000000 00:05 154858 /dev/nvidiactl
201804000-201a00000 ---p 00000000 00:00 0
201a00000-201e00000 rw-s 00000000 00:05 154858 /dev/nvidiactl
201e00000-202c00000 ---p 00000000 00:00 0
202c00000-202c04000 rw-s 00000000 00:05 154858 /dev/nvidiactl
202c04000-202e00000 ---p 00000000 00:00 0
202e00000-203200000 rw-s 00000000 00:05 154858 /dev/nvidiactl
203200000-204000000 ---p 00000000 00:00 0
204000000-204004000 rw-s 00000000 00:05 154858 /dev/nvidiactl
204004000-204200000 ---p 00000000 00:00 0
204200000-204600000 rw-s 00000000 00:05 154858 /dev/nvidiactl
204600000-205400000 ---p 00000000 00:00 0
205400000-205404000 rw-s 00000000 00:05 154858 /dev/nvidiactl
205404000-205600000 ---p 00000000 00:00 0
205600000-205a00000 rw-s 00000000 00:05 154858 /dev/nvidiactl
205a00000-206800000 ---p 00000000 00:00 0
206800000-206804000 rw-s 00000000 00:05 154858 /dev/nvidiactl
206804000-206a00000 ---p 00000000 00:00 0
206a00000-206e00000 rw-s 00000000 00:05 154858 /dev/nvidiactl
206e00000-207c00000 ---p 00000000 00:00 0
207c00000-207c04000 rw-s 00000000 00:05 154858 /dev/nvidiactl
207c04000-207e00000 ---p 00000000 00:00 0
207e00000-208200000 rw-s 00000000 00:05 154858 /dev/nvidiactl
208200000-209000000 ---p 00000000 00:00 0
209000000-209004000 rw-s 00000000 00:05 154858 /dev/nvidiactl
209004000-209200000 ---p 00000000 00:00 0
209200000-209600000 rw-s 00000000 00:05 154858 /dev/nvidiactl
209600000-20a400000 ---p 00000000 00:00 0
20a400000-20a404000 rw-s 00000000 00:05 154858 /dev/nvidiactl
20a404000-20a600000 ---p 00000000 00:00 0
20a600000-20aa00000 rw-s 00000000 00:05 154858 /dev/nvidiactl
20aa00000-20aa04000 rw-s 00000000 00:05 154858 /dev/nvidiactl
20aa04000-20ac00000 ---p 00000000 00:00 0
20ac00000-20b000000 rw-s 00000000 00:05 154858 /dev/nvidiactl
20b000000-20b004000 rw-s 00000000 00:05 154858 /dev/nvidiactl
and sometimes it is
*** Aborted at 1525523656 (unix time) try "date -d @1525523656" if you are using GNU date ***
PC: @ 0x7f7c0376048a (unknown)
*** SIGSEGV (@0x0) received by PID 89364 (TID 0x7f79559fb700) from PID 0; stack trace: ***
@ 0x7f7c03abd390 (unknown)
@ 0x7f7c0376048a (unknown)
@ 0x7f7c03763cde (unknown)
@ 0x7f7c03766184 __libc_malloc
@ 0x7f7b7f400a36 (unknown)
@ 0x7f7b7f979634 (unknown)
@ 0x7f7b7fa10d34 (unknown)
@ 0x7f7b7fa131b7 (unknown)
@ 0x7f7b7f409ecb (unknown)
@ 0x7f7b7f40a40c cudnnConvolutionBackwardFilter
@ 0x7f7bc570ac3b _ZN6caffe210CuDNNState7executeIRZNS_19CudnnConvGradientOp13DoRunWithTypeIfffffffEEbvEUlPS0_E1_EEvP11CUstream_stOT_
@ 0x7f7bc571337c caffe2::CudnnConvGradientOp::DoRunWithType<>()
@ 0x7f7bc56fead0 caffe2::CudnnConvGradientOp::RunOnDevice()
@ 0x7f7bc568694b caffe2::Operator<>::Run()
@ 0x7f7bf797ec5a caffe2::DAGNet::RunAt()
@ 0x7f7bf797da15 caffe2::DAGNetBase::WorkerFunction()
@ 0x7f7bfd6b7c80 (unknown)
@ 0x7f7c03ab36ba start_thread
@ 0x7f7c037e941d clone
@ 0x0 (unknown)
In an existing config, modify TRAIN.SCALES
to (320,)
, TRAIN.MAX_SIZE
to 500
. Since I was using an FPN config, I modified FPN.RPN_ANCHOR_START_SIZE
to 16 and ROI_CANONICAL_SCALE
to 90.
I have tested on COCO and VOC, both fails.
PYTHONPATH
environment variable: nullpython --version
output: Python 2.7.12I have the same problem with COCO dataset and sometimes it happens in test time too, not only while training.
And what's more when I reduced the TEST.RPN_PRE_NMS_TOP_N
from 1000 to 100, sometimes (~1 in 5 times) the test runs fine, sometimes models (including my own model and the faster-rcnn-R-50-FPN_1x model in model zoo) crashed in test time when I ran test_net.py
. The error messages are also various.
I am quite confused about ROI_CANONICAL_SCALE: 90,how can i get it ?
@moyans You can use grep -rn <detectron directory> -e "ROI_CANONICAL_SCALE" --include=*.py
to get all lines containing "ROI_CANONICAL_SCALE" in .py files
@daquexian Sorry, I didn't explain it clearly. I know where it is. The original value is 224. I'm just wondering how to calculate this number.
@moyans I calculated it by 224 * 320 / 800.
Did you find a solution for your problem? I got something similar when I reduce the image-scale for my own dataset to 360x480.. I think it is related to the cython_nms, since if I configure TRAIN/TEST.RPN_NMS_THRESH: 0.0 The model trains and evaluates (but with worse results of course..).
Another workaround was to use a larger TRAIN/TEST.SCALE (e.g. 800, with TRAIN/TEST.MAX_SIZE: 1333)
I also tried to debug this, but I still couldn't figure out the problem... using the CPU NMS (without cython) from the old py-faster-rcnn repo I found that sometimes you divide by zero inside the NMS (when the two boxes both have area 0). This should be fixed by setting TRAIN/TEST.RPN_MIN_SIZE > 0. But it seems that this is not the only problem...
Could you please try to switch off the RPN_NMS (set RPN_NMS_THRESH to 0.0) and see if it works then?
Maybe it could also be a problem that the number of anchors/proposals is too small when we use NMS on the regressed boxes that were generated on very small feature maps (due to the reduced input image-size)
@pfollmann Thanks for your information! I may try it tomorrow. Does the bug still exist even though TRAIN/TEST.RPN_MIN_SIZE > 0?
Yes, unfortunately with TRAIN/TEST.RPN_MIN_SIZE I still got errors at random iterations in your above-mentioned style..
I think that I found the problem: It is in detectron/utils/cython_nms.pyx:
cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]
The numpy.argsort function seems to be buggy at this point (no clue why). I replaced it with the cython argsort implementation from https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx. To make it work the following changes are necessary:
ctypedef cnp.float32_t FLOAT_t
import utils.argsort as argsort
cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
argsort.argsort(-scores, order)
For me the training works fine now for 20k iterations and also the inference had no more seg-faults (including a little speed-up ;-) )
The open question still is why cython_nms.pyx worked fine for other configurations of the TRAIN/TEST.SCALE? In my experience not the image-scale was the problem but the size of objects that are getting very small in the case of rescaled images to small sizes.
Hope that helps!
PS: My current Detectron-version is quite far from the master, therefore I'm not sure if I find time to launch a PR soon..
@pfollmann Wow! So coool! I will try it soon. Thanks!
On Thu, May 17, 2018, 10:57 PM pfollmann notifications@github.com wrote:
I think that I found the problem: It is in detectron/utils/cython_nms.pyx:
cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]
The numpy.argsort function seems to be buggy at this point (no clue why).
I replaced it with the cython argsort implementation from
https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx.
To make it work the following changes are necessary:
- Place argsort.pyx in detectron/utils
- Change line 13 in argsort.pyx to
ctypedef cnp.float32_t FLOAT_t- register the file in setup.py (similar to cython_nms.pyx and
cython_bbox.pyx)- include it in detectron/utils/cython_nms.pyx, i.e. change the file
as follows:
import utils.argsort as argsort ... cdef np.ndarray[np.int_t, ndim=1]
order = np.empty((ndets), dtype=np.intp) argsort.argsort(-scores, order)- and run 'make' in detectron to compile the cython-modules again.
For me the training works fine now for 20k iterations and also the
inference had no more seg-faults (including a little speed-up ;-) )The open question still is why cython_nms.pyx worked fine for other
configurations of the TRAIN/TEST.SCALE? In my experience not the
image-scale was the problem but the size of objects that are getting very
small in the case of rescaled images to small sizes.Hope that helps!
PS: My current Detectron-version is quite far from the master, therefore
I'm not sure if I find time to launch a PR soon..—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/facebookresearch/Detectron/issues/415#issuecomment-389896464,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALEcn-6IxvfvqGsyY2sjINVV_0D92lXdks5tzY_cgaJpZM4Tzn30
.
@pfollmann It works! Thanks! I'd like to remain this issue open because the patch has not been merged into master.
Looking forward to your PR :) You might want to fetch the master and modify the corresponding files. The steps on master are nothing different from those you pointed out above.
@pfollmann Thanks, save my day!
hi,i meet this problem when use the above method,do you have any methhods to solve it,thank you very much
import detectron.utils.cython_nms as cython_nms
File "detectron/utils/cython_nms.pyx", line 27, in init detectron.utils.cython_nms
import utils.argsort as argsort
ImportError: No module named utils.argsort
@shenghsiaowong I think you should use import detectron.utils.argsort as argsort
because the project structure changed after @pfollmann posted his solution.
,,,i have change it,but it does not work,i know thisis a small issue ,but ihave no idea
File "detectron/utils/cython_nms.pyx", line 27, in init detectron.utils.cython_nms
#import detectron.utils.argsort as argsort
ImportError: No module named utils.argsort
what is meaning of this?thank you
import detectron.utils.cython_nms as cython_nms
File "detectron/utils/cython_nms.pyx", line 28, in init detectron.utils.cython_nms
import detectron.utils.argsort as argsort
ImportError: dynamic module does not define init function (initargsort)
@shenghsiaowong sorry I haven't met this error. @pfollmann do you have time to send a PR for your excellent solution so that every user can benefit from it seamlessly? :)
@pfollmann @daquexian Hi, thanks for the solution, but when I try your advice, other error happened: Error in 'python': free() invalid next size (fast)
Could you give any appreciate advice about it?
@shenghsiaowong
try to paste "import detectron.utils.argsort as argsort"
after
cimport cython
import numpy as np
cimport numpy as np
@karenyun I am meeting same problem here, did you figure it out?
@karenyun @StepOITD I met the same problem and solve it as follows:
(1) changing the file detectron/utils/cython_nms.pyx as pfollmann suggested
(2) putting the following sentences after the definition of ndets
. In other words, change the original code snippet:
cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]
cdef int ndets = dets.shape[0]
cdef np.ndarray[np.int_t, ndim=1] suppressed = \
np.zeros((ndets), dtype=np.int)
to
cdef int ndets = dets.shape[0]
cdef np.ndarray[np.int_t, ndim=1] suppressed = \
np.zeros((ndets), dtype=np.int)
cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
argsort.argsort(-scores, order)
Hope it helps.
Most helpful comment
I think that I found the problem: It is in detectron/utils/cython_nms.pyx:
cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]
The numpy.argsort function seems to be buggy at this point (no clue why). I replaced it with the cython argsort implementation from https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx. To make it work the following changes are necessary:
ctypedef cnp.float32_t FLOAT_t
import utils.argsort as argsort
...
cdef np.ndarray[np.int_t, ndim=1] order = np.empty((ndets), dtype=np.intp)
argsort.argsort(-scores, order)
For me the training works fine now for 20k iterations and also the inference had no more seg-faults (including a little speed-up ;-) )
The open question still is why cython_nms.pyx worked fine for other configurations of the TRAIN/TEST.SCALE? In my experience not the image-scale was the problem but the size of objects that are getting very small in the case of rescaled images to small sizes.
Hope that helps!
PS: My current Detectron-version is quite far from the master, therefore I'm not sure if I find time to launch a PR soon..