Caffe: BatchReindexLayer fails GPU gradient tests under CUDA v9.1

Created on 11 Jan 2018 · 27Comments · Source: BVLC/caffe

Your system configuration

Operating system: CentOS 7.4.1708
Compiler: x86_64-conda_cos6-linux-gnu-g++, gcc version 7.2.0 (crosstool-NG)
Graphics card: nVIDIA GeForce GTX 1070
CUDA version (if applicable): 9.1
CUDNN version (if applicable): 7.0.5
BLAS: openblas 0.2.20
Python or MATLAB version (for pycaffe and matcaffe respectively):
Anaconda 3 5.0.1 64-bit Python 3.6.4

Steps to reproduce

As shown in this issue crosstool-NG compiled libraries compatibility problem, so I use Anaconda's gxx_linux-64 7.2.0 compiler to compile caffe (commit e93b5e20) on CentOS 7 with this
Makefile.config and these Anaconda packages (including libopenblas, leveldb, lmdb, opecv, protobuf, glog, gflags, py-boost, libboost, ...) and the following commands:

PATH=/cad/anaconda3/bin:/usr/bin make -j8
PATH=/cad/anaconda3/bin:/usr/bin make -j8 test
LD_LIBRARY_PATH=/usr/local/cuda/lib64 make runtest

However, make runtest failed at ./build/test/test_batch_reindex_layer.testbin with these error messages:

[----------] 2 tests from BatchReindexLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] BatchReindexLayerTest/3.TestForward
[       OK ] BatchReindexLayerTest/3.TestForward (3 ms)
[ RUN      ] BatchReindexLayerTest/3.TestGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.68212591193169037, which exceeds threshold_ * scale, where
computed_gradient evaluates to -0.68212591193169037,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.01.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18315754813835761; objective+ = 3.4152213335738013; objective- = 3.4152213335738013
...

After countless caffe compilations and tests, finally, I find a workaround to this problem: I add a line NVCCFLAGS += -G to Makefile and it changes from

...
# Debugging
ifeq ($(DEBUG), 1)
        COMMON_FLAGS += -DDEBUG -g -O0
        NVCCFLAGS += -G
else
        COMMON_FLAGS += -DNDEBUG -O2
endif
...

...
# Debugging
ifeq ($(DEBUG), 1)
        COMMON_FLAGS += -DDEBUG -g -O0
        NVCCFLAGS += -G
else
        COMMON_FLAGS += -DNDEBUG -O2
        NVCCFLAGS += -G
endif
...

Then, compiling caffe again... and make runtest passes without failure!

I think the problem could be unrelated to Python and x86_64-conda_cos6-linux-gnu-g++ compiler, but related to nvcc (CUDA 9.1). So, this problem might be reproduced with other g++ compiler!? Actually, I see someone has the same problem with Ubuntu 16.04 + CUDA 9.1. I hope someone can fix this problem.

bug

Source

MrMYHuang

👍8

Most helpful comment

Confirmed, the following tests fails on ubuntu 17.10, CUDA 9.1 and CUDNN 7.
[ FAILED ] 2 tests, listed below: [ FAILED ] BatchReindexLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice<float> [ FAILED ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice<double>
I was able to pass the tests by following @MrMYHuang 's suggestion to add NVCCFLAG.

MacwinWin on 22 Jan 2018

👍7

All 27 comments

Confirmed on a standard Ubuntu 16.04 build both by myself (with GCC 5.4.0 and NVCC 9.1.85) and others: first in #6140, but also on caffe-users (thread1, thread2, thread3, thread 4).

Your workaround is to add a -G flag to NVCC even for the standard build, correct? This flag causes generation of debug information for GPU code and disables all optimizations [ref] - the latter effect seems more relevant.

Noiredd on 11 Jan 2018

👍2

Hi Noiredd,

Your workaround is to add a -G flag to NVCC even for the standard build, correct?

Yes.

Additionally, I find another runtest failure which is unrelated to NVCCFLAGS += -G but related to opencv 3.3.1: if I enable opencv (by commenting USE_OPENCV := 0), the runtest will fail at
./build/test/test_net.testbin with these error messages:

Cuda number of devices: 1
Current device id: 0
Current device name: GeForce GTX 1070
[==========] Running 124 tests from 5 test cases.
[----------] Global test environment set-up.
[----------] 26 tests from NetTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] NetTest/0.TestHasBlob
[       OK ] NetTest/0.TestHasBlob (593 ms)
[ RUN      ] NetTest/0.TestGetBlob
[       OK ] NetTest/0.TestGetBlob (2 ms)
...
[ RUN      ] NetTest/0.TestSharedWeightsResume
[       OK ] NetTest/0.TestSharedWeightsResume (0 ms)
[ RUN      ] NetTest/0.TestParamPropagateDown
[       OK ] NetTest/0.TestParamPropagateDown (1 ms)
[ RUN      ] NetTest/0.TestFromTo
src/caffe/test/test_net.cpp:1446: Failure
Value of: *loss_ptr
  Actual: 6.95498
Expected: loss
Which is: 6.94028
src/caffe/test/test_net.cpp:1446: Failure
Value of: *loss_ptr
  Actual: 6.95498
Expected: loss
Which is: 6.94028
[  FAILED  ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice<float> (3 ms)
[ RUN      ] NetTest/0.TestReshape
...
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice<float>
[  FAILED  ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice<double>

MrMYHuang on 12 Jan 2018

Confirmed, the following tests fails on CUDA 9.1 and CUDNN 7.

**[  FAILED  ] 2 tests, listed below:
[  FAILED  ] BatchReindexLayerTest/2.TestGradient, where TypeParam = N5caffe9GPUDeviceIfEE
[  FAILED  ] BatchReindexLayerTest/3.TestGradient, where TypeParam = N5caffe9GPUDeviceIdEE**

I was able to pass the tests by following @MrMYHuang 's suggestion to add NVCCFLAG.

neilpanchal on 14 Jan 2018

👍6

@MrMYHuang's suggestion worked. You have to add NVCCFLAG += -G to Makefile and do

$ make clean & make all & make test & make runtest

srivathsapv on 18 Jan 2018

👍5

I submitted a bug report to nVIDIA. An nVIDIA staff replied me that the (CUDA) development team has identified this (CUDA 9.1) issue and is planning to fix it in the next release. At this time, it is suggested to use CUDA 9.0.

MrMYHuang on 19 Jan 2018

👍6

I also had this problem.

CUDA 9.1 + cuDNN 7.0.5 + caffe [ FAILED ] 1 tests. mnist OK. my net failed
CUDA 9.0 + cuDNN 7.0.5 + caffe [ FAILED ] 2 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 7.0.5 + caffe [ PASSED ] 2123 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 6.0.21+ caffe [ PASSED ] 2123 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 5.0.5 + caffe-rc5 [ PASSED ] 2123 tests. mnist OK. my net OK.
CUDA 8.0 + cuDNN 6.0.21+ caffe-1.0 [ PASSED ] 2123 tests. mnist OK. my net OK.

GTX 1080, Ubuntu 16.04.3, Driver Version: 387.34, i7 980X 3.3GHz, P6T-SE, RAM 6GB

CUDA 8.0 + cuDNN 7.0.5 passed "make runtest", and passed Caffe's mnist training.
But my net training failed. Its accuracy got 100% around 300 training iterations.
Using CPU training was OK. I gave up latest Caffe. Finally,
CUDA 8.0 + cuDNN 6.0.21+ caffe-1.0 was OK.

My net is for computer Go. It predicts next move.
12 conv layers, 128 channels, kernel_size is 3, without batch normalization.

yssaya on 22 Jan 2018

MacwinWin on 22 Jan 2018

👍7

I am also using CUDA 9.1 and CUDNN v7.0.5 and can confirm this failure. I actually came here to post another test failure I had but when I was about to I noticed that disabling multi-gpu's fixed that test failure and presented this one I will post that in a seperate issue though.

edit: actually after unsetting the "CUDA_VISIBLE_DEVICES" variable the other issue I am referring to is no longer occurring oddly. I guess I won't post an issue for that right now until I can get the log to be generated again. I might not have re-enabled multi-gpu support properly.

bdmccord on 4 Feb 2018

👍1

Unfortunately with latest nvcc patch 2 released problem with
BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice
still persist.

ghost on 3 Mar 2018

👍3

Issue still exists, with phyical machine + single Pascal + CentOS 7 + nvcc 9.1.85 + cudnn 7.0.5

xkszltl on 4 Apr 2018

Issue exists for me too:
Physical Machine
Ubuntu 16.04
Nvidia drivers: 390.48
CUDA: 9.1.85 + Patch 1,2,3
cuDNN: cuDNN v7.1.2

Got:

[  FAILED  ] 1 test, listed below:
[  FAILED  ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice<double>

After adding "NVCCFLAGS += -G" as OP suggested, no error and all passed.

But what does this mean for us when the flag is added?
i.e. (Are the optimizations only disabled during make or completely?)

evilmtv on 8 Apr 2018

I too had this error with @evilmtv's setup (except Ubuntu 14.04). I wanted to try following @Noiredd's suggestion and see whether this problem could be fixed by only changing the optimization level with the --optimize flag (rather than the -G flag).

Short answer: no. The -G is the needed work around.

After changing Makefile.config so that NVCCFLAGS += --optimize 0 (or NVCCFLAGS += -O0) and removing the -O2 entry from COMMON_FLAGS in the Makefile (line 322) so as to avoid an error caused by repeating the flag, the same tests failed.

weinman on 20 Apr 2018

Same problem occured when compiling under Gentoo Linux
gcc - 6.4.0
cuda 9.1.85
glibc 2.26-r6
and caffe compiled without python support

NVCCFLAGS += -G fixed it.

ghost on 20 Apr 2018

GPU: Nvidia GT 1030
Ubuntu 16.04, kernel 4.10.0-28-generic
Driver: 387.34
caffe: commit 864520713a4c5ffae7382ced5d34e4cadc608473
CUDNN: 7.0.5.15-1+cuda9.1
CuBLAS: 9.1.85.3-1
Cuda-NVCC: 9.1.85.2-1

4 trsts failed:
[==========] 2199 tests from 285 test cases ran. (343419 ms total)
[ PASSED ] 2195 tests.
[ FAILED ] 4 tests, listed below:
[ FAILED ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] BatchReindexLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice
[ FAILED ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice

Try add option "-G" to Makefile, does not fix... in any case 3 test failed..
I try train ResNet 34, and ResNet 18, network, (for make it faster on 6 images) and next try to run it on CPU. It is not working after training... But mnist, normal working, and simplified bottleneck ResNet50 too working. Don't know depended it this unit tests or not....

lubagov on 3 May 2018

I rebuild with CUDA 8.0 + cuDNN 6.0.21, with disabled OpenCV, and all tests passed.
But, before i use OpenCV 2.4 from ubuntu repo, not 3.3.1.
Without OpenCV ofcose i fave not ImageData layer, it use imread, and cv::mat to load image files, and it is not good for me.

Ok, i solve all tests, and now training/run all network work fine. In CPU and in GPU same.
[ FAILED ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice

This 2 falue depended to last version MKL 2018.2.199. When i replace it to Atlas, it fine working, and this 2 test passed.
CUDA 8.0 + cuDNN 6.0.21 + Atlas 3.10.2-9, work for me....

lubagov on 4 May 2018

Confirmed.
Debian SId,
GCC-6/CUDA 9.1/Nvidia 390.48

cdluminate on 27 May 2018

Has anyone tested CUDA 9.2?

cdluminate on 27 May 2018

@CDLuminate All tests passed with latest commit + CUDA 9.2 + gcc 7.3.1

xkszltl on 27 May 2018

@xkszltl Thanks. That means I can remove the temporary fix from Debian/Ubuntu's pre-built binary package as long as CUDA 9.2 is available. With -G enabled for nvcc, the performance drop looks significant ...

cdluminate on 27 May 2018

@CDLuminate
Don't....simply trust me....
Experience may vary by system and...luck...๑乛◡乛๑

BTW I'm on CentOS

xkszltl on 28 May 2018

Not working here with latest commit + libcudnn7 (7.1.4.18-1+cuda9.2) + cuda 9.2 + gcc (5.4)
=/

jeiks on 15 Jun 2018

All tests passed with commit 864520713a4c5ffae7382ced5d34e4cadc608473 + CentOS 7.5.1804 + CUDA 9.2 + CUDNN 7.1 + gcc 4.8.5!

MrMYHuang on 1 Jul 2018

👍1

Not working for me. Details of problem in following link :
https://github.com/BVLC/caffe/issues/6686

meriem87 on 29 Jan 2019

@meriem87 You issue looks unrelated to this.

xkszltl on 30 Jan 2019

$ make clean & make all & make test & make runtest

Are you root?

Swjtu-only on 5 Feb 2020

No, you aren't.
You are a normal user with GPU access.

On Tue, Feb 4, 2020, 23:39 Swjtu-only notifications@github.com wrote:

$ make clean & make all & make test & make runtest

Are you root?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/BVLC/caffe/issues/6164?email_source=notifications&email_token=ABSIJETW2FBOCDZVSMXDWT3RBIRG3A5CNFSM4ELJ53X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZ6ECQ#issuecomment-582214154,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABSIJESGNZL4GQUBPH5NKRLRBIRG3ANCNFSM4ELJ53XQ
.

jeiks on 5 Feb 2020

No, you aren't. You are a normal user with GPU access.
…
On Tue, Feb 4, 2020, 23:39 Swjtu-only @.*> wrote: $ make clean & make all & make test & make runtest Are you root? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6164?email_source=notifications&email_token=ABSIJETW2FBOCDZVSMXDWT3RBIRG3A5CNFSM4ELJ53X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZ6ECQ#issuecomment-582214154>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSIJESGNZL4GQUBPH5NKRLRBIRG3ANCNFSM4ELJ53XQ .

Thanks,if i am a normal GPU access,i will meet permissions issue.
So,i decided run one by one and lucky everything is ok for me.

Swjtu-only on 5 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings