Detectron: 多GPU训练会抛出非法的内存访问

创建于 2018-01-25  ·  64评论  ·  资料来源: facebookresearch/Detectron

当我使用一个GPU进行训练时,没有问题。 但是当我使用两个或四个GPU时,问题就出来了。 日志输出:

抛出'caffe2 :: EnforceNotMet'实例后终止调用
what():[在context_gpu.h:170处执行失败]。 遇到CUDA错误:遇到非法的内存访问操作员的错误:
输入:“ gpu_0 / rpn_cls_logits_fpn2_w_grad”输入:“ gpu_1 / rpn_cls_logits_fpn2_w_grad”输出:“ gpu_0 / rpn_cls_logits_fpn2_w_grad”名称:“”类型:“添加” device_option {device_type:1 cuda_gpu_id
*如果您使用的是GNU日期,则终止于1516866180(unix时间),请尝试“ date -d @ 1516866180” *
递归终止
递归终止
递归终止
PC:@ 0x7ff67559f428 gsignal
递归终止
递归终止
E0125 07:43:00.745853 55683 pybind_state.h:422]运行PythonOp函数时遇到异常:RuntimeError:[在context_gpu.h:307强制失败]错误== cudaSuccess。 77 vs0。错误位于:/mnt/hzhida/project/caffe2/caffe2/core/context_gpu.h:307:遇到非法的内存访问

在:
/mnt/hzhida/facebook/detectron/lib/ops/generate_proposals.py(101):转发
* PID 55375(TID 0x7ff453fff700)从PID 55375接收到的SIGABRT(@ 0x3e80000d84f);
递归终止
@ 0x7ff675945390(未知)
@ 0x7ff67559f428 gsignal
@ 0x7ff6755a102a中止
@ 0x7ff66f37e84d __gnu_cxx :: __ verbose_terminate_handler()
@ 0x7ff66f37c6b6(未知)
@ 0x7ff66f37c701 std :: terminate()
@ 0x7ff66f3a7d38(未知)
@ 0x7ff67593b6ba start_thread
@ 0x7ff67567141d克隆
@ 0x0(未知)
中止(核心已弃用)

upstream bug

最有用的评论

谢谢你们。 这证实了我的假设,即当未启用对等访问时,非法内存访问来自“添加”操作无法正确处理跨设备通信。 将发出修复程序。

所有64条评论

我遇到了同样的错误。 区别在于当我使用一个或两个GPU时,没有问题。 但是使用4个GPU训练Mask RCNN(mask_rcnn_R-101-FPN)或RetinaNet(retinanet_R-101-FPN)时,会出现相同的问题。

当我使用两个或多个GPU训练tutorial_Res50网络时,我遇到了同样的问题。

指定GPU ID时遇到相同的问题(即与最低ID不同,例如4个GPU的“ 1、3、5、7”)。 如果指定了最低的GPU ID,则训练会继续进行。

@jwnsu :我们正在努力解决问题,以便在CUDA_VISIBLE_DEVICES不使用最低ID的情况下,训练仍然有效。 感谢您的报告和诊断。

@jwnsu@coolbrain@tshizys@lwher :我们无法在我们这方面重现此问题。

你们每个人能否提供更多可能揭示共同模式的信息?

特别是:

  • 操作系统: ?
  • 编译器版本:
  • CUDA版本:
  • cuDNN版本:
  • NVIDIA驱动程序版本:
  • GPU型号(对于所有设备,如果不是完全一样的话):
  • 其他似乎相关的内容:

这是我们在训练时看到的内容,例如使用GPU ID 1,3,5,7:

CUDA_VISIBLE_DEVICES=1,3,5,7 python2 tools/train_net.py --cfg configs/12_2017_baselines/e2e_faster_rcnn_R-50-FPN_1x.yaml OUTPUT_DIR /tmp/dbg-cvd-train TRAIN.DATASETS "('coco_2014_minival',)" NUM_GPUS 4

Every 0.1s: nvidia-smi                                                                                                                                                                                                                                                                                                                             Fri Jan 26 09:09:26 2018

Fri Jan 26 09:09:26 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           On   | 0000:07:00.0     Off |                  Off |
|  0%   42C    P8    17W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40           On   | 0000:08:00.0     Off |                  Off |
|  0%   51C    P0   144W / 250W |   7214MiB / 12209MiB |     46%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M40           On   | 0000:09:00.0     Off |                  Off |
|  0%   38C    P8    19W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M40           On   | 0000:0A:00.0     Off |                  Off |
|  0%   52C    P0   220W / 250W |   7502MiB / 12209MiB |     38%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla M40           On   | 0000:0B:00.0     Off |                  Off |
|  0%   40C    P8    17W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla M40           On   | 0000:0C:00.0     Off |                  Off |
|  0%   60C    P0    85W / 250W |   7081MiB / 12209MiB |     48%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla M40           On   | 0000:0D:00.0     Off |                  Off |
|  0%   40C    P8    20W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla M40           On   | 0000:0E:00.0     Off |                  Off |
|  0%   56C    P0    81W / 250W |   7494MiB / 12209MiB |     40%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7210MiB |
|    3   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7498MiB |
|    5   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7077MiB |
|    7   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7490MiB |
+-----------------------------------------------------------------------------+

作业系统:Ubuntu 16.04
编译器版本:gcc(Ubuntu 5.4.0-6ubuntu1〜16.04.4)5.4.0
CUDA版本:8.0
cuDNN版本:v5.1
NVIDIA驱动程序版本:384.111

nvidia-smi:
+ ------------------------------------------------- ---------------------------- +
| NVIDIA-SMI 384.111驱动程序版本:384.111 |
| ------------------------------- + ----------------- ----- + ---------------------- +
| GPU名称持久性-M | 总线编号Disp.A | 挥发性不佳。 ECC |
| 风扇温度性能:用法/上限| 内存使用| GPU实用计算M。
| ============================== + |
| 0特斯拉M60关闭| 00001543:00:00.0关闭| 关|
| 不适用42C P0 41W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +
| 1特斯拉M60关闭| 00003134:00:00.0关闭| 关|
| 不适用42C P0 39W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +
| 2特斯拉M60关闭| 00004975:00:00.0关闭| 关|
| 不适用38C P0 41W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +
| 3特斯拉M60关闭| 0000F3E6:00:00.0关闭| 关|
| 不适用38C P0 40W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +

操作系统:CentOS Linux版本7.1.1503
编译器版本:gcc版本4.8.2
CUDA版本:CUDA 8.0
cuDNN版本:cuDNN 6.0.21
NVIDIA驱动程序版本:375.26
GPU型号:4x GeForce GTX TITAN X(12G)

nvidia-smi:
image

当使用4个GPU(0、1、2、3)来训练Mask RCNN(e2e_mask_rcnn_R-101-FPN),RetinaNet(retinanet_R-101-FPN)或Faster RCNN(e2e_faster_rcnn_R-50-FPN)时,错误“ context_gpu.h” :307:遇到非法的内存访问”或“ context_gpu.h:170。 遇到CUDA错误:遇到非法的内存访问操作员输入的错误:输入:“ gpu_0 / retnet_cls_pred_fpn3_b_grad”输入:“ gpu_2 / retnet_cls_pred_fpn3_b_grad”输出:“ gpu_0 / retnet_cls_pred_fpn3_b_grad”名称:“”设备类型: :0}”。

但是使用一个GPU或两个GPU(0,1或2,3),就可以正常地对其进行训练。
谢谢。

@jwnsu :更仔细地查看您的错误(“无效的设备顺序”),您似乎正在尝试针对设置为8个GPU的配置进行训练,但将进程限制为只能访问4个(通过CUDA_VISIBLE_DEVICES )。 “无效的设备顺序”错误是因为它试图在该进程无权访问的设备上创建操作。

@ coolbrain@ tshizys :感谢您提供详细信息。 如果您使用两个ID为{0,2},{0,3},{1,2}或{1,3}的GPU,会发生什么情况?

@rbgirshick您说的没错,选择了错误的配置文件(设置了8个GPU)来尝试昨天。 只需再次尝试使用正确的配置文件(4个GPU,gpu id为“ 1、2、4、5”,“ 0、1、2、3”的错误即可正常工作),该错误现在类似于其他人看到的错误:

I0127 09:06:48.220716 10872 context_gpu.cu:325] Total: 20748 MB
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/retnet_bbox_pred_fpn3_b_grad" input: "gpu_2/retnet_bbox_pred_fpn3_b_grad" output: "gpu_0/retnet_bbox_pred_fpn3_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
  what():  [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_2/retnet_cls_conv_n3_fpn3" input: "gpu_2/__m13_shared" output: "gpu_2/__m13_shared" name: "" type: "ReluGradient" arg { name: "cudnn_exhaustive_search" i: 0 } arg { name: "order" s: "NCHW" } device_option { device_type: 1 cuda_gpu_id: 2 } engine: "CUDNN" is_gradient_op: true
*** Aborted at 1517072808 (unix time) try "date -d @1517072808" if you are using GNU date ***
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
PC: @     0x7fd71f6bd428 gsignal
*** SIGABRT (@0x3e900002a18) received by PID 10776 (TID 0x7fd548e3d700) from PID 10776; stack trace: ***
    @     0x7fd71fa63390 (unknown)
    @     0x7fd71f6bd428 gsignal
    @     0x7fd71f6bf02a abort
    @     0x7fd71b51c84d __gnu_cxx::__verbose_terminate_handler()
    @     0x7fd71b51a6b6 (unknown)
    @     0x7fd71b51a701 std::terminate()
    @     0x7fd71b545d38 (unknown)
    @     0x7fd71fa596ba start_thread
    @     0x7fd71f78f41d clone
    @                0x0 (unknown)
./itrain4.sh: line 9: 10776 Aborted                 (core dumped) python2 tools/train_net.py --multi-gpu-testing --cfg configs/iret-rn50-fpn-voc.yaml OUTPUT_DIR ./output

@coolbrain@tshizys :黑暗中的一招是通过将USE_NCCL True传递给train_net.py来将全缩减实现切换到nccl,如下所示:

python2 tools/train_net.py --multi-gpu-testing \
  --cfg configs/getting_started/tutorial_2gpu_e2e_faster_rcnn_R-50-FPN.yaml \
  OUTPUT_DIR /tmp/output USE_NCCL True

这将要求Caffe2已使用nccl ops进行构建-我不确定这是默认完成还是需要一些工作才能在nccl支持下重建Caffe2。

@rbgirshick ,当使用两个GPU,即{0,2},{0,3},{1,2},{1,3}时,该错误仍然存​​在。 例如,这是使用{0,3}并培训RetinaNet(retinanet_R-101-FPN)的详细信息:

F0128 12:09:08.461153 4938 context_gpu.cu:387]错误出现在:/home/yszhu/local/caffe2/caffe2/core/context_gpu.cu:387:遇到非法的内存访问
*检查故障堆栈跟踪:
递归终止
递归终止
如果您使用的是GNU日期,则终止于1517112548(unix时间),请尝试“ date -d @ 1517112548”
抛出'caffe2 :: EnforceNotMet'实例后终止调用
what():[在context_gpu.h:170处执行失败]。 遇到CUDA错误:遇到非法的内存访问操作员的错误:
输入:“ gpu_0 / fpn_6_relu”输入:“ gpu_0 / fpn_7_w”输入:“ gpu_0 / __ m23_shared”输出:“ gpu_0 / fpn_7_w_grad”输出:“ gpu_0 / fpn_7_b_grad”输出:“ gpu_0 / __ m22_shared”名称:“ Grad” “ arg {名称:” kernel“ i:3} arg {名称:” exhaustive_search“ i:0} arg {名称:” pad“ i:1} arg {名称:” order“ s:” NCHW“} arg {名称:“步幅” i:2} device_option {device_type:1 cuda_gpu_id:0}引擎:“ CUDNN” is_gradient_op:true
@ 0x7f2bdf712772 google :: LogMessage :: Fail()
PC:@ 0x0(未知)
由PID 4791(TID 0x7f2a6effd700)从PID 4791接收到的SIGABRT(@ 0x3e8000012b7);
@ 0x7f2bdf7126ce google :: LogMessage :: SendToLog()
@ 0x7f2c2670e130(未知)
@ 0x7f2bdf71204c google :: LogMessage :: Flush()
@ 0x7f2c25c6a5d7 __GI_raise
@ 0x7f2bdf71556d google :: LogMessageFatal ::〜LogMessageFatal()
@ 0x7f2c25c6bcc8 __GI_abort
@ 0x7f2c1b1b1965 __gnu_cxx :: __ verbose_terminate_handler()
@ 0x7f2bdfdd1180 caffe2 :: CUDAContext :: Delete()
@ 0x7f2c1b1af946(未知)
@ 0x7f2be27f42d9 std :: _ Sp_counted_base <> :: _ M_release()
@ 0x7f2c1b1af973 std :: terminate()
@ 0x7f2c1b2062c5(未知)
@ 0x7f2bdfd377d1 caffe2 :: Tensor <> :: ResizeLike <>()
@ 0x7f2c26706df5 start_thread
@ 0x7f2bdfd6e3e2 _ZN6caffe210CuDNNState7executeIRZNS_19CudnnConvGradientOp13DoRunWithTypeIffffffffEEbvEUlPS0_E1_EEvP11CUstream_stOT_
@ 0x7f2c25d2b1ad __clone
@ 0x7f2bdfd707e1 caffe2 :: CudnnConvGradientOp :: DoRunWithType <>()
@ 0x0(未知)

image

每次的错误形式都不相同,而只是“遇到的CUDA错误:遇到了非法的内存访问”。

我还使用nccl-1.3.5重建了caffe2(遵循https://caffe2.ai/docs/getting-started.html?platform=centos&configuration=cloud#null__troubleshooting):

image

并通过将USE_NCCL True传递给train_net.py将全缩减实现切换为nccl,如下所示:

python2 tools / train_net.py --multi-gpu-testing \
--cfg configs / 12_2017_baselines / retinanet_R-101-FPN_1x_4gpus.yaml \
OUTPUT_DIR results_retinanet_R-101-FPN_1x_4gpus_model USE_NCCL是

对于两个GPU {0,1,2,3}或两个GPU {0,2},{0,3},{1,2},{1,3}中的任何一个,错误消失了^-^。
@rbgirshick ,非常感谢。

嗨,我打开了nccl op来训练tutorial_network,以上错误消失了。 但是,该程序在加载数据后挂起,并始终占用100%的CPU。

.......
I0129 03:25:13.106998 118074 context_gpu.cu:321] GPU 0:2175 MB
I0129 03:25:13.107028 118074 context_gpu.cu:321] GPU 1:2078 MB
I0129 03:25:13.107045 118074 context_gpu.cu:321] GPU 2:2266 MB
I0129 03:25:13.107059 118074 context_gpu.cu:321] GPU 3:1860 MB
I0129 03:25:13.107072 118074 context_gpu.cu:325]总计:8381 MB
I0129 03:25:13.122316 118079 context_gpu.cu:321] GPU 0:2195 MB
I0129 03:25:13.122344 118079 context_gpu.cu:321] GPU 1:2145 MB
I0129 03:25:13.122361 118079 context_gpu.cu:321] GPU 2:2267 MB
I0129 03:25:13.122378 118079 context_gpu.cu:321] GPU 3:1924 MB
I0129 03:25:13.122395 118079 context_gpu.cu:325]总计:8532 MB
I0129 03:25:13.151623 118079 context_gpu.cu:321] GPU 0:2245 MB
I0129 03:25:13.151650 118079 context_gpu.cu:321] GPU 1:2159 MB
I0129 03:25:13.152823 118079 context_gpu.cu:321] GPU 2:2269 MB
I0129 03:25:13.153623 118079 context_gpu.cu:321] GPU 3:2020 MB
I0129 03:25:13.154454 118079 context_gpu.cu:325]总计:8694 MB
I0129 03:25:13.186017 118079 context_gpu.cu:321] GPU 0:2260 MB
I0129 03:25:13.186053 118079 context_gpu.cu:321] GPU 1:2214 MB
I0129 03:25:13.186067 118079 context_gpu.cu:321] GPU 2:2279 MB
I0129 03:25:13.186077 118079 context_gpu.cu:321] GPU 3:2080 MB
I0129 03:25:13.186089 118079 context_gpu.cu:325]总计:8835 MB
I0129 03:25:13.215306 118076 context_gpu.cu:321] GPU 0:2310 MB
I0129 03:25:13.215342 118076 context_gpu.cu:321] GPU 1:2269 MB
I0129 03:25:13.215351 118076 context_gpu.cu:321] GPU 2:2308 MB
I0129 03:25:13.215368 118076 context_gpu.cu:321] GPU 3:2081 MB
I0129 03:25:13.215384 118076 context_gpu.cu:325]总计:8970 MB
I0129 03:25:13.307595 118084 context_gpu.cu:321] GPU 0:2310 MB
I0129 03:25:13.307623 118084 context_gpu.cu:321] GPU 1:2301 MB
I0129 03:25:13.307641 118084 context_gpu.cu:321] GPU 2:2391 MB
I0129 03:25:13.307652 118084 context_gpu.cu:321] GPU 3:2104 MB
I0129 03:25:13.307665 118084 context_gpu.cu:325]总计:9108 MB
I0129 03:25:13.324935 118077 context_gpu.cu:321] GPU 0:2312 MB
I0129 03:25:13.324965 118077 context_gpu.cu:321] GPU 1:2313 MB
I0129 03:25:13.324982 118077 context_gpu.cu:321] GPU 2:2452 MB
I0129 03:25:13.324993 118077 context_gpu.cu:321] GPU 3:2171 MB
I0129 03:25:13.325011 118077 context_gpu.cu:325]总计:9250 MB
I0129 03:25:13.343673 118080 context_gpu.cu:321] GPU 0:2336 MB
I0129 03:25:13.343698 118080 context_gpu.cu:321] GPU 1:2380 MB
I0129 03:25:13.343715 118080 context_gpu.cu:321] GPU 2:2468 MB
I0129 03:25:13.343731 118080 context_gpu.cu:321] GPU 3:2233 MB
I0129 03:25:13.343747 118080 context_gpu.cu:325]总计:9417 MB
I0129 03:25:13.369802 118085 cuda_nccl_gpu.cc:110]为以下键创建NCCLContext:0:0,1,2,3,
I0129 03:25:13.381914 118076 context_gpu.cu:321] GPU 0:2361 MB
I0129 03:25:13.381942 118076 context_gpu.cu:321] GPU 1:2453 MB
I0129 03:25:13.381961 118076 context_gpu.cu:321] GPU 2:2524 MB
I0129 03:25:13.381978 118076 context_gpu.cu:321] GPU 3:2247 MB
I0129 03:25:13.381995 118076 context_gpu.cu:325]总计:9587 MB
I0129 03:25:13.613253 118083 context_gpu.cu:321] GPU 0:2388 MB
I0129 03:25:13.613292 118083 context_gpu.cu:321] GPU 1:2525 MB
I0129 03:25:13.613301 118083 context_gpu.cu:321] GPU 2:2524 MB
I0129 03:25:13.613308 118083 context_gpu.cu:321] GPU 3:2310 MB
I0129 03:25:13.613315 118083 context_gpu.cu:325]总计:9748 MB

程序挂起......

我的环境:
作业系统:Ubuntu 16.04
编译器版本:gcc(Ubuntu 5.4.0-6ubuntu1〜16.04.4)5.4.0
CUDA版本:8.0
cuDNN版本:v5.1
NVIDIA驱动程序版本:384.111

nvidia-smi:
+ ------------------------------------------------- ---------------------------- +
| NVIDIA-SMI 384.111驱动程序版本:384.111 |
| ------------------------------- + ----------------- ----- + ---------------------- +
| GPU名称持久性-M | 总线编号Disp.A | 挥发性不佳。 ECC |
| 风扇温度性能:用法/上限| 内存使用| GPU实用计算M。
| ============================== + |
| 0特斯拉M60关闭| 00001543:00:00.0关闭| 关|
| 不适用42C P0 41W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +
| 1特斯拉M60关闭| 00003134:00:00.0关闭| 关|
| 不适用42C P0 39W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +
| 2特斯拉M60关闭| 00004975:00:00.0关闭| 关|
| 不适用38C P0 41W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +
| 3特斯拉M60关闭| 0000F3E6:00:00.0关闭| 关|
| 不适用38C P0 40W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +

@lwher :不幸的是。 默认情况下,我们不使用NCCL的原因是它容易导致死锁,这就是我认为您所看到的。

用NCCL重建caffe2后,我使用以下脚本重新运行该程序:
python工具/train_net.py \
--multi-gpu测试
--cfg configs / getting_started / tutorial_4gpu_e2e_faster_rcnn_R-50-FPN.yaml \
OUTPUT_DIR ./输出USE_NCCL是

抛出此错误:

为密钥创建NCCLContext:0:0、1、2、3,
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
警告:

您应该始终与libnvidia-ml.so一起运行,以便随您的计算机一起安装
NVIDIA显示驱动程序。 默认情况下,它安装在/ usr / lib和/ usr / lib64中。
GDK软件包中的libnvidia-ml.so是一个存根库,仅针对
构建目的(例如,构建应用程序的计算机没有
安装显示驱动程序)。
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
抛出'caffe2 :: EnforceNotMet'实例后终止调用
what():[强制执行cuda_nccl_gpu.cc:40失败] status == ncclSuccess。 2 vs0。错误位于:/mnt/hzhida/project/caffe2/caffe2/contrib/nccl/cuda_nccl_gpu.cc40:系统错误来自操作员的错误:
输入: “gpu_0 / rpn_cls_logits_fpn2_w_grad” 输入: “GPU_1 / rpn_cls_logits_fpn2_w_grad” 输入: “GPU_2 / rpn_cls_logits_fpn2_w_grad” 输入: “gpu_3 / rpn_cls_logits_fpn2_w_grad” 输出: “gpu_0 / rpn_cls_logits_fpn2_w_grad” 输出: “GPU_1 / rpn_cls_logits_fpn2_w_grad” 输出: “GPU_2 / rpn_cls_logits_fpn2_w_grad” 输出:“ gpu_3 / rpn_cls_logits_fpn2_w_grad”名称:“”类型:“ NCCLAllreduce” device_option {device_type:1 cuda_gpu_id:0}
*如果您使用的是GNU日期,则终止于1517210588(unix时间),请尝试“ date -d @ 1517210588”
PC:@ 0x7ff1e0383428 gsignal
由PID 31302(TID 0x7fefb5ffb700)从PID 31302接收到的SIGABRT(@ 0x3e800007a46);
I0129 07:23:08.187249 31591 cuda_nccl_gpu.cc:110]为以下键创建NCCLContext:0:0,1,2,3,

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
警告:

您应该始终与libnvidia-ml.so一起运行,以便随您的计算机一起安装
NVIDIA显示驱动程序。 默认情况下,它安装在/ usr / lib和/ usr / lib64中。
GDK软件包中的libnvidia-ml.so是一个存根库,仅针对
构建目的(例如,构建应用程序的计算机没有
安装显示驱动程序)。
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
递归终止
@ 0x7ff1e0729390(未知)
I0129 07:23:08.188051 31592 context_gpu.cu:321] GPU 0:2466 MB
I0129 07:23:08.188074 31592 context_gpu.cu:321] GPU 1:2387 MB
I0129 07:23:08.188091 31592 context_gpu.cu:321] GPU 2:2311 MB
I0129 07:23:08.188099 31592 context_gpu.cu:321] GPU 3:2382 MB
I0129 07:23:08.188107 31592 context_gpu.cu:325]总计:9548 MB
@ 0x7ff1e0383428 gsignal
@ 0x7ff1e038502a中止
@ 0x7ff1da16284d __gnu_cxx :: __ verbose_terminate_handler()
@ 0x7ff1da1606b6(未知)
@ 0x7ff1da160701 std :: terminate()
@ 0x7ff1da18bd38(未知)
@ 0x7ff1e071f6ba start_thread
@ 0x7ff1e045541d克隆
@ 0x0(未知)
中止(核心已弃用)

运行环境:
作业系统:Ubuntu 16.04
编译器版本:gcc(Ubuntu 5.4.0-6ubuntu1〜16.04.4)5.4.0
CUDA版本:8.0
cuDNN版本:v5.1
NVIDIA驱动程序版本:384.111

nvidia-smi:
+ ------------------------------------------------- ---------------------------- +
| NVIDIA-SMI 384.111驱动程序版本:384.111 |
| ------------------------------- + ----------------- ----- + ---------------------- +
| GPU名称持久性-M | 总线编号Disp.A | 挥发性不佳。 ECC |
| 风扇温度性能:用法/上限| 内存使用| GPU实用计算M。
| ============================== + |
| 0特斯拉M60关闭| 00001543:00:00.0关闭| 关|
| 不适用42C P0 41W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +
| 1特斯拉M60关闭| 00003134:00:00.0关闭| 关|
| 不适用42C P0 39W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +
| 2特斯拉M60关闭| 00004975:00:00.0关闭| 关|
| 不适用38C P0 41W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +
| 3特斯拉M60关闭| 0000F3E6:00:00.0关闭| 关|
| 不适用38C P0 40W / 150W | 0MiB / 8123MiB | 0%默认值|
+ ------------------------------- + ----------------- ----- + ---------------------- +

关于NCCL的其他说明:默认情况下,Caffe2使用NCCL

跳转至此:由于非法内存访问来自“添加”运算符,因此您可能要检查正在使用的GPU之间是否可以进行直接对等访问。 当前的Add op依赖于此,否则,我们可能确实希望修复代码。 基本上,要这样做,在python中,请执行以下操作:

from caffe2.python import workspace
print(workspace.GetCudaPeerAccessPattern())

您可以粘贴其输出以进行调试吗? (特别是,如果您使用的是CUDA_VISIBLE_DEVICES,请确保您也使用CUDA_VISIBLE_DEVICES来调用python)

您的两条调试行中的@Yangqing输出:
[[ True True False False] [ True True False False] [False False True True] [False False True True]]
感谢您调查此问题(以及... caffe / caffe2框架!)

@jwnsu谢谢! 只是为了确认,所以Add运算符正在跨gpu {0,1}和{2,3}添加张量,对吗? (我假设它是从4个GPU中添加东西)。

这是4 gpus配置,GPU ID指定为“ 0,1,2,4”(通过CUDA_VISIBLE_DEVICES。)如果GPU ID配置为“ 0,1,2,3”(最低GPU ID),则可以正常工作错误。

@杨庆
我的Linux服务器有4个M60 GPU,
这是我的工作区。GetCudaPeerAccessPattern()输出:
[[对错对错
[False True False False]
[错误错误错误错误]
[False False False True]]

我可以很好地使用1 gpu训练网络,但是当我使用2或4个GPUS训练网络时,即使设置NCCL = True,我也会遇到与上面相同的问题

谢谢你们。 这证实了我的假设,即当未启用对等访问时,非法内存访问来自“添加”操作无法正确处理跨设备通信。 将发出修复程序。

跨设备通信中的相同问题...
这台机器可以使用4个GPU [0,1,2,3]:
image
本机可以使用[0,1]和[2,3]:
image

顺便说一句,我已经使用12 Cpu和4 titan x在pytorch框架中训练了3D Faster RCNN。 为什么Pytorch没有这个问题?

@Yangqing因为我无法在多GPU中训练Detectron,所以我想知道您将解决跨GPU通信问题多长时间? 谢谢。

@Yangqing我遇到了与上述类似的问题。 我的Linux工作站有2个GTX-1080Ti。 错误信息如下:
[enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator: input: "gpu_0/rpn_cls_logits_fpn2_b_grad" input: "gpu_1/rpn_cls_logits_fpn2_b_grad" output: "gpu_0/rpn_cls_logits_fpn2_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
和我的workspace.GetCudaPeerAccessPattern()输出为:
[[真假]
[假真]
是否也是跨GPU通信问题? 如果没有,那么任何人都可以帮助我修复它。

是的,这是同样的问题。 跨GPU中的梯度无法叠加在一起,因为GPU之间无法相互通信。 如果您想解决问题,也许可以将梯度从GPU复制到CPU,然后将它们加起来并取平均值。 最后,将平均梯度从CPU复制到GPU。 @blateyang

谢谢你的建议! @coolbrain但是我不明白为什么有些人可以使用两个或多个GPU成功地训练模型。 他们是否遇到过相同的跨GPU通信问题?

在这里可以训练4个具有最低GPU ID(0、1、2、3)或最高GPU ID(4、5、6、7)的GPU,而不会出现任何错误(8 GPU可能也可以工作,但尚未尝试过)。 )仅在混合特定ID(例如“ 0,1,2,4”或“ 1,3,5,7”)时才有问题。

可疑的caffe2跨GPU通信问题可能会因单独的硬件版本而有所不同(先前提到的rbgirshick也与ID混合使用Facebook M40服务器)。

遇到同样的问题。 这是固定的吗?

我在配备4个GTX 1080TI GPU的工作站上遇到了相同的问题。 Multi-gpu在caffe和tensorflow等其他平台上也能很好地工作。
这是我的工作区。GetCudaPeerAccessPattern()输出:
[[True True False False]
[正确正确错误错误]
[False False True True True]
[False False True True True]
两GPU配置(使用{0,1}或{2,3})效果很好。 三个或四个GPU将面临上述问题。 但是,我的错误不在Add操作上,我记得类型是Copy

问题已解决吗?

@rbgirshick嗨,我遇到了与@lwher相同的问题。 该程序似乎在使用Ubuntu 14.04和4个GPU的计算机上使用NCCL的机会几乎达到了50%。 有解决方案可以避免NCCL的此类行为吗? 非常感谢!

@Yangqing您好,我在Copy运算符中遇到了同样的问题。
当我不添加USE_NCCL True标志时,错误如下:

E0325 02:26:02.258566  8284 operator_schema.cc:73] Input index 0 and output idx 0 (gpu_0/res3_0_branch2a_w_grad) are set to be in-place but this is actually not supported by op Copy
Original python traceback for operator 2817 in network `generalized_rcnn` in exception above (most recent call last):
  File "tools/train_net.py", line 358, in <module>
  File "tools/train_net.py", line 196, in main
  File "tools/train_net.py", line 205, in train_model
  File "tools/train_net.py", line 283, in create_model
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 120, in create
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 92, in generalized_rcnn
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 254, in build_generic_detection_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 42, in build_data_parallel_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 84, in _add_allreduce_graph
  File "/home/shuqin/git/caffe2/build/caffe2/python/muji.py", line 64, in Allreduce
  File "/home/shuqin/git/caffe2/build/caffe2/python/muji.py", line 204, in AllreduceFallback
Traceback (most recent call last):
  File "tools/train_net.py", line 358, in <module>
    main()
  File "tools/train_net.py", line 196, in main
    checkpoints = train_model()
  File "tools/train_net.py", line 210, in train_model
    setup_model_for_training(model, output_dir)
  File "tools/train_net.py", line 316, in setup_model_for_training
    workspace.CreateNet(model.net)
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 166, in CreateNet
    StringifyProto(net), overwrite,
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 192, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at operator.cc:125] schema->Verify(operator_def). Operator def did not pass schema checking: input: "gpu_0/res3_0_branch2a_w_grad" output: "gpu_0/res3_0_branch2a_w_grad" name: "" type: "Copy" device_option { device_type: 1 cuda_gpu_id: 0 }

如果我添加了USE_NCCL True标志,则错误变为:

Original python traceback for operator 2928 in network `generalized_rcnn` in exception above (most recent call last):
  File "tools/train_net.py", line 358, in <module>
  File "tools/train_net.py", line 196, in main
  File "tools/train_net.py", line 205, in train_model
  File "tools/train_net.py", line 283, in create_model
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 120, in create
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 92, in generalized_rcnn
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 254, in build_generic_detection_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 42, in build_data_parallel_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 82, in _add_allreduce_graph
Traceback (most recent call last):
  File "tools/train_net.py", line 358, in <module>
    main()
  File "tools/train_net.py", line 196, in main
    checkpoints = train_model()
  File "tools/train_net.py", line 217, in train_model
    workspace.RunNet(model.net.Proto().name)
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 230, in RunNet
    StringifyNetName(name), num_iter, allow_fail,
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 192, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at cuda_nccl_gpu.cc:40] status == ncclSuccess. 2 vs 0.  Error at: /home/shuqin/git/caffe2/caffe2/contrib/nccl/cuda_nccl_gpu.cc40: system error Error from operator:
input: "gpu_0/rpn_cls_logits_fpn2_b_grad" input: "gpu_1/rpn_cls_logits_fpn2_b_grad" input: "gpu_2/rpn_cls_logits_fpn2_b_grad" output: "gpu_0/rpn_cls_logits_fpn2_b_grad" output: "gpu_1/rpn_cls_logits_fpn2_b_grad" output: "gpu_2/rpn_cls_logits_fpn2_b_grad" name: "" type: "NCCLAllreduce" device_option { device_type: 1 cuda_gpu_id: 0 }

我的系统是Ubuntu 14.04,带有Cuda 8.0和Cudnn 5.1。 我的机器有8个GPU,但是我仅在最后4个上测试了代码,因此GPU之间的通信应该没有问题。 我将NCCL 2.1.15用于CUDA 8.0。

希望这个问题能尽快解决。 真烦人。

这个问题仍然存在,对吗?

通过在运行多GPU培训时添加“ USE_NCLL True”,我成功地开始了培训。 尽管有时可能会发生死锁,但是您可以尝试修改一些训练参数(例如学习率)来解决。

问题仍然存在。

@xieshuqin我遇到了同样的问题'状态== ncclSuccess。 2比0。 使用“ USE_NCCL True”时与您一起使用。如何解决此问题?

@pkuxwguan我的问题已解决,但我忘记了如何解决。 对于那个很抱歉。 但是我确实记得问题应该与错误安装NCCL有关。

大家好,我也遇到了这个问题,所以我终于自己解决了。 https://github.com/pytorch/pytorch/pull/6896解决了这个问题:)

有人告诉我是否可以仅使用一个GPU运行mask r-cnn吗?

@daquexian我尝试了您的PR,它有效!!! 非常感谢

@daquexian此PR对我似乎无效。 我在使用不带NCCL的单个GPU以及在使用2个具有USE_NCCL True GPU时遇到死锁。 在根据您的PR更改muji.py并使用带有USE_NCCL True 2个GPU运行之后,我仍然遇到死锁; 训练只是在随机的迭代次数上暂停。

感谢您的尝试:)如果您使用我的计算机,则无需设置USE_NCCL = True
pr。 NCCL和“ muji”是两种不同的gpu通讯方法。 我的公关是
适用于muji的补丁,之前需要gpu对等方访问,而不适用于nccl。
只需设置USE_NCCL = False,我的PR就可以工作。

2018年5月2日,星期三,凌晨2:51,托马斯·巴尔雷斯特里[email protected]
写道:

@daquexian https://github.com/daquexian此PR似乎不起作用
为我。 我在使用不带NCCL的单个GPU时遇到死锁
并且在使用2个GPU和USE_NCCL True的情况下。 更改muji.py之后
根据您的PR并使用USE_NCCL的2个GPU来运行,我是
仍处于僵局; 训练只是在随机迭代时暂停
数字。

-
您收到此邮件是因为有人提到您。
直接回复此电子邮件,在GitHub上查看
https://github.com/facebookresearch/Detectron/issues/32#issuecomment-385755468
或使线程静音
https://github.com/notifications/unsubscribe-auth/ALEcn2nGO9e-fIF8S3bTDNkK4370hjOVks5tuK7DgaJpZM4Rsc8n

也许我缺少了一些东西,但是如果我将USE_NCCL = False设置为,并使用修改后的muji.py和muji_test.py PR,我将得到原始错误:

I0502 14:35:57.192476 79712 context_gpu.cu:318] Total: 23025 MB
E0502 14:35:58.382604 79711 net_dag.cc:195] Exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/rpn_cls_logits_fpn2_b_grad" input: "gpu_1/rpn_cls_logits_fpn2_b_grad" output: "gpu_0/rpn_cls_logits_fpn2_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
E0502 14:35:58.382622 79712 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
F0502 14:35:58.382670 79711 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 14:35:58.382683 79712 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
E0502 14:35:58.383510 79709 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_1/fpn_res3_3_sum" input: "gpu_1/conv_rpn_fpn2_w" input: "gpu_1/__m18_shared" output: "_gpu_1/conv_rpn_fpn2_w_grad_autosplit_2" output: "_gpu_1/conv_rpn_fpn2_b_grad_autosplit_2" output: "_gpu_1/fpn_res3_3_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 1 } engine: "CUDNN" is_gradient_op: true
E0502 14:35:58.383541 79713 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at conv_op_cudnn.cc:1290] status == CUDNN_STATUS_SUCCESS. 8 vs 0. , Error at: /home/markable-ai/pytorch/caffe2/operators/conv_op_cudnn.cc:1290: CUDNN_STATUS_EXECUTION_FAILED Error from operator: 
input: "gpu_3/conv_rpn_fpn4" input: "gpu_3/rpn_bbox_pred_fpn2_w" input: "gpu_3/rpn_bbox_pred_fpn4_grad" output: "_gpu_3/rpn_bbox_pred_fpn2_w_grad_autosplit_1" output: "_gpu_3/rpn_bbox_pred_fpn2_b_grad_autosplit_1" output: "gpu_3/__m13_shared" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
E0502 14:35:58.383591 79706 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_3/conv_rpn_fpn3" input: "gpu_3/rpn_cls_logits_fpn2_w" input: "gpu_3/rpn_cls_logits_fpn3_grad" output: "_gpu_3/rpn_cls_logits_fpn2_w_grad_autosplit_2" output: "_gpu_3/rpn_cls_logits_fpn2_b_grad_autosplit_2" output: "_gpu_3/conv_rpn_fpn3_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
F0502 14:35:58.382683 79712 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encounteredF0502 14:35:58.434631 79709 context_gpu.h:107] FCheck failed: error == cudaSuccess an illegal memory access was encountered0502 14:35:58.434648 79713 c*** Check failure stack trace: ***
E0502 14:35:58.383741 79700 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_3/conv_rpn_fpn2" input: "gpu_3/rpn_cls_logits_fpn2_w" input: "gpu_3/rpn_cls_logits_fpn2_grad" output: "_gpu_3/rpn_cls_logits_fpn2_w_grad_autosplit_3" output: "_gpu_3/rpn_cls_logits_fpn2_b_grad_autosplit_3" output: "_gpu_3/conv_rpn_fpn2_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
Aborted (core dumped)

我正在使用带有4个V100的Cuda 9.1,Cudnn 7.1。

@ Feynman27你能告诉我在更新的muji.py中输入了Allreduce哪个分支(例如Allreduce4Allreduce4Group2Allreduce2或其他)吗? 您可能想在这些分支中添加一些打印功能以了解它。 如果只调用AllreduceFallback替换Allreduce的实现,该怎么办? 如果您还可以提供gpu访问模式(例如https://github.com/facebookresearch/Detectron/issues/32#issuecomment -361739340),那将是很好的。 谢谢!

Allreduce4正在被调用。 gpu访问模式是:

>>> from caffe2.python import workspace
>>> print(workspace.GetCudaPeerAccessPattern())
[[ True False False False]
 [False  True False False]
 [False False  True False]
 [False False False  True]]

我将尝试致电AllreduceFallback.

调用AllreduceFallback会产生与上述类似的错误:

I0502 17:08:51.294476 88651 context_gpu.cu:318] Total: 22524 MB
E0502 17:08:52.009866 88659 net_dag.cc:195] Exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
F0502 17:08:52.009990 88659 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
E0502 17:08:52.010440 88651 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_2/fpn_res3_3_sum" input: "gpu_2/conv_rpn_fpn2_w" input: "gpu_2/__m15_shared" output: "_gpu_2/conv_rpn_fpn2_w_grad_autosplit_2" output: "_gpu_2/conv_rpn_fpn2_b_grad_autosplit_2" output: "_gpu_2/fpn_res3_3_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 2 } engine: "CUDNN" is_gradient_op: true
E0502 17:08:52.010524 88663 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_1/fpn_res2_2_sum" input: "gpu_1/conv_rpn_fpn2_w" input: "gpu_1/__m12_shared" output: "_gpu_1/conv_rpn_fpn2_w_grad_autosplit_3" output: "_gpu_1/conv_rpn_fpn2_b_grad_autosplit_3" output: "_gpu_1/fpn_res2_2_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 1 } engine: "CUDNN" is_gradient_op: true
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
E0502 17:08:52.010577 88653 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_0/fpn_res4_22_sum" input: "gpu_0/conv_rpn_fpn2_w" input: "gpu_0/__m15_shared" output: "_gpu_0/conv_rpn_fpn2_w_grad_autosplit_1" output: "_gpu_0/conv_rpn_fpn2_b_grad_autosplit_1" output: "_gpu_0/fpn_res4_22_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" is_gradient_op: true
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
07] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
07] Check failed: error == cudaSuccess an illegal memory access was encounteredF0502 17:08:52.061749 88653 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
Aborted (core dumped

@ Feynman27很奇怪。 根据您的gpu访问模式,将调用AllreduceFallback而不是Allreduce4 。 并且当您手动调用AllreduceFallback ,错误消息似乎不是来自AllreduceFallback 。 您是否更改了右文件夹中的muji.py ? 例如,如果caffe2的Python包是在/usr/lib/python/site-packages/caffe2 ,则改变muji.py在caffe2的源文件夹(如~/caffe2/python )不会工作。

@ Feynman27您是否重建了caffe2?

@daquexian caffe2软件包安装在pytorch/caffe2 ,而不是/usr/lib/python/site-packages/caffe2或其他任何版本下。 我将$PYTHONPATH为在此目录中查找。 我也通过以下方式证实了这一点:

Python 2.7.14 |Anaconda, Inc.| (default, Mar 27 2018, 17:29:31) 
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import caffe2
>>> caffe2.__file__
'/home/markable-ai/pytorch/build/caffe2/__init__.pyc'
>>> from caffe2.python import muji
>>> muji.__file__
'/home/markable-ai/pytorch/build/caffe2/python/muji.pyc'
>>> 

我只是修改了muji.py下的文件pytorch/caffe2/python/muji.py

@yuzcccc我没有重建caffe2,但是为什么我必须重建? 我只是在修改python文件。

@ Feynman27我认为您应该在/home/markable-ai/pytorch/build/caffe2/python/muji.py下修改muji.py /home/markable-ai/pytorch/build/caffe2/python/muji.py

是的,那是我的疏忽。 接得好。 我正在修改pytorch/caffe2/python/muji.py ,应该已经修改了pytorch/build/caffe2/python/muji.py

@ Feynman27很高兴看到它起作用:)
@Yangqing您能否查看我的PR https://github.com/pytorch/pytorch/pull/6896? 它可能会帮助许多Detectron用户:)

@daquexian不幸的是,我似乎仍然遇到僵局。

@ Feynman27 Hmm .. USE_NCCL的值是多少? 应该是False

是的, USE_NCCL被设置为false。

@ Feynman27对不起,我不知道为什么会导致死锁。 我很难复制

很公平。 据我所知,我遇到的死锁可能与是否启用GPU对等访问无关。 您的PR绝对可以让我从USE_NCCL=False开始培训。 我在Azure机器上运行,因此可能与在其VM上运行有关。 我已经开始使用2台TitanX在本地计算机上进行培训,并且培训似乎进展顺利。

@daquexian谢谢! 您的公关为我工作!

看来这个问题可以解决。

@gadcam感谢您帮助确定可以解决的问题!

对于这个,我想将其保持打开状态,直到将修复程序合并到Caffe2中为止。

@rbgirshick不幸的是,没有人评论我的PR :|

@rbgirshick谢谢! 我的PR https://github.com/pytorch/pytorch/pull/6896已合并。 看来这个问题可以解决了:)

此页面是否有帮助?
0 / 5 - 0 等级