Caffe: BatchReindexLayer fails GPU gradient tests under CUDA v9.1

Created on 11 Jan 2018  ·  27Comments  ·  Source: BVLC/caffe

Your system configuration

Operating system: CentOS 7.4.1708
Compiler: x86_64-conda_cos6-linux-gnu-g++, gcc version 7.2.0 (crosstool-NG)
Graphics card: nVIDIA GeForce GTX 1070
CUDA version (if applicable): 9.1
CUDNN version (if applicable): 7.0.5
BLAS: openblas 0.2.20
Python or MATLAB version (for pycaffe and matcaffe respectively):
Anaconda 3 5.0.1 64-bit Python 3.6.4

Steps to reproduce

As shown in this issue crosstool-NG compiled libraries compatibility problem, so I use Anaconda's gxx_linux-64 7.2.0 compiler to compile caffe (commit e93b5e20) on CentOS 7 with this
Makefile.config and these Anaconda packages (including libopenblas, leveldb, lmdb, opecv, protobuf, glog, gflags, py-boost, libboost, ...) and the following commands:

PATH=/cad/anaconda3/bin:/usr/bin make -j8
PATH=/cad/anaconda3/bin:/usr/bin make -j8 test
LD_LIBRARY_PATH=/usr/local/cuda/lib64 make runtest

However, make runtest failed at ./build/test/test_batch_reindex_layer.testbin with these error messages:

[----------] 2 tests from BatchReindexLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] BatchReindexLayerTest/3.TestForward
[       OK ] BatchReindexLayerTest/3.TestForward (3 ms)
[ RUN      ] BatchReindexLayerTest/3.TestGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.68212591193169037, which exceeds threshold_ * scale, where
computed_gradient evaluates to -0.68212591193169037,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.01.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18315754813835761; objective+ = 3.4152213335738013; objective- = 3.4152213335738013
...

After countless caffe compilations and tests, finally, I find a workaround to this problem: I add a line NVCCFLAGS += -G to Makefile and it changes from

...
# Debugging
ifeq ($(DEBUG), 1)
        COMMON_FLAGS += -DDEBUG -g -O0
        NVCCFLAGS += -G
else
        COMMON_FLAGS += -DNDEBUG -O2
endif
...

to

...
# Debugging
ifeq ($(DEBUG), 1)
        COMMON_FLAGS += -DDEBUG -g -O0
        NVCCFLAGS += -G
else
        COMMON_FLAGS += -DNDEBUG -O2
        NVCCFLAGS += -G
endif
...

Then, compiling caffe again... and make runtest passes without failure!

I think the problem could be unrelated to Python and x86_64-conda_cos6-linux-gnu-g++ compiler, but related to nvcc (CUDA 9.1). So, this problem might be reproduced with other g++ compiler!? Actually, I see someone has the same problem with Ubuntu 16.04 + CUDA 9.1. I hope someone can fix this problem.

bug

Most helpful comment

Confirmed, the following tests fails on ubuntu 17.10, CUDA 9.1 and CUDNN 7.
[ FAILED ] 2 tests, listed below: [ FAILED ] BatchReindexLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice<float> [ FAILED ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice<double>
I was able to pass the tests by following @MrMYHuang 's suggestion to add NVCCFLAG.

All 27 comments

Confirmed on a standard Ubuntu 16.04 build both by myself (with GCC 5.4.0 and NVCC 9.1.85) and others: first in #6140, but also on caffe-users (thread1, thread2, thread3, thread 4).

Your workaround is to add a -G flag to NVCC even for the standard build, correct? This flag causes generation of debug information for GPU code and disables all optimizations [ref] - the latter effect seems more relevant.

Hi Noiredd,

Your workaround is to add a -G flag to NVCC even for the standard build, correct?

Yes.

Additionally, I find another runtest failure which is unrelated to NVCCFLAGS += -G but related to opencv 3.3.1: if I enable opencv (by commenting USE_OPENCV := 0), the runtest will fail at
./build/test/test_net.testbin with these error messages:

Cuda number of devices: 1
Current device id: 0
Current device name: GeForce GTX 1070
[==========] Running 124 tests from 5 test cases.
[----------] Global test environment set-up.
[----------] 26 tests from NetTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] NetTest/0.TestHasBlob
[       OK ] NetTest/0.TestHasBlob (593 ms)
[ RUN      ] NetTest/0.TestGetBlob
[       OK ] NetTest/0.TestGetBlob (2 ms)
...
[ RUN      ] NetTest/0.TestSharedWeightsResume
[       OK ] NetTest/0.TestSharedWeightsResume (0 ms)
[ RUN      ] NetTest/0.TestParamPropagateDown
[       OK ] NetTest/0.TestParamPropagateDown (1 ms)
[ RUN      ] NetTest/0.TestFromTo
src/caffe/test/test_net.cpp:1446: Failure
Value of: *loss_ptr
  Actual: 6.95498
Expected: loss
Which is: 6.94028
src/caffe/test/test_net.cpp:1446: Failure
Value of: *loss_ptr
  Actual: 6.95498
Expected: loss
Which is: 6.94028
[  FAILED  ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice<float> (3 ms)
[ RUN      ] NetTest/0.TestReshape
...
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice<float>
[  FAILED  ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice<double>

Confirmed, the following tests fails on CUDA 9.1 and CUDNN 7.

**[  FAILED  ] 2 tests, listed below:
[  FAILED  ] BatchReindexLayerTest/2.TestGradient, where TypeParam = N5caffe9GPUDeviceIfEE
[  FAILED  ] BatchReindexLayerTest/3.TestGradient, where TypeParam = N5caffe9GPUDeviceIdEE**

I was able to pass the tests by following @MrMYHuang 's suggestion to add NVCCFLAG.

@MrMYHuang's suggestion worked. You have to add NVCCFLAG += -G to Makefile and do

$ make clean & make all & make test & make runtest

I submitted a bug report to nVIDIA. An nVIDIA staff replied me that the (CUDA) development team has identified this (CUDA 9.1) issue and is planning to fix it in the next release. At this time, it is suggested to use CUDA 9.0.

I also had this problem.

CUDA 9.1 + cuDNN 7.0.5 + caffe [ FAILED ] 1 tests. mnist OK. my net failed
CUDA 9.0 + cuDNN 7.0.5 + caffe [ FAILED ] 2 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 7.0.5 + caffe [ PASSED ] 2123 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 6.0.21+ caffe [ PASSED ] 2123 tests. mnist OK. my net failed
CUDA 8.0 + cuDNN 5.0.5 + caffe-rc5 [ PASSED ] 2123 tests. mnist OK. my net OK.
CUDA 8.0 + cuDNN 6.0.21+ caffe-1.0 [ PASSED ] 2123 tests. mnist OK. my net OK.

GTX 1080, Ubuntu 16.04.3, Driver Version: 387.34, i7 980X 3.3GHz, P6T-SE, RAM 6GB

CUDA 8.0 + cuDNN 7.0.5 passed "make runtest", and passed Caffe's mnist training.
But my net training failed. Its accuracy got 100% around 300 training iterations.
Using CPU training was OK. I gave up latest Caffe. Finally,
CUDA 8.0 + cuDNN 6.0.21+ caffe-1.0 was OK.

My net is for computer Go. It predicts next move.
12 conv layers, 128 channels, kernel_size is 3, without batch normalization.

Confirmed, the following tests fails on ubuntu 17.10, CUDA 9.1 and CUDNN 7.
[ FAILED ] 2 tests, listed below: [ FAILED ] BatchReindexLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice<float> [ FAILED ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice<double>
I was able to pass the tests by following @MrMYHuang 's suggestion to add NVCCFLAG.

I am also using CUDA 9.1 and CUDNN v7.0.5 and can confirm this failure. I actually came here to post another test failure I had but when I was about to I noticed that disabling multi-gpu's fixed that test failure and presented this one I will post that in a seperate issue though.

edit: actually after unsetting the "CUDA_VISIBLE_DEVICES" variable the other issue I am referring to is no longer occurring oddly. I guess I won't post an issue for that right now until I can get the log to be generated again. I might not have re-enabled multi-gpu support properly.

Unfortunately with latest nvcc patch 2 released problem with
BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice
still persist.

Issue still exists, with phyical machine + single Pascal + CentOS 7 + nvcc 9.1.85 + cudnn 7.0.5

Issue exists for me too:
Physical Machine
Ubuntu 16.04
Nvidia drivers: 390.48
CUDA: 9.1.85 + Patch 1,2,3
cuDNN: cuDNN v7.1.2

Got:

[  FAILED  ] 1 test, listed below:
[  FAILED  ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice<double>

After adding "NVCCFLAGS += -G" as OP suggested, no error and all passed.

But what does this mean for us when the flag is added?
i.e. (Are the optimizations only disabled during make or completely?)

I too had this error with @evilmtv's setup (except Ubuntu 14.04). I wanted to try following @Noiredd's suggestion and see whether this problem could be fixed by only changing the optimization level with the --optimize flag (rather than the -G flag).

Short answer: no. The -G is the needed work around.

After changing Makefile.config so that NVCCFLAGS += --optimize 0 (or NVCCFLAGS += -O0) and removing the -O2 entry from COMMON_FLAGS in the Makefile (line 322) so as to avoid an error caused by repeating the flag, the same tests failed.

Same problem occured when compiling under Gentoo Linux
gcc - 6.4.0
cuda 9.1.85
glibc 2.26-r6
and caffe compiled without python support

NVCCFLAGS += -G fixed it.

GPU: Nvidia GT 1030
Ubuntu 16.04, kernel 4.10.0-28-generic
Driver: 387.34
caffe: commit 864520713a4c5ffae7382ced5d34e4cadc608473
CUDNN: 7.0.5.15-1+cuda9.1
CuBLAS: 9.1.85.3-1
Cuda-NVCC: 9.1.85.2-1

4 trsts failed:
[==========] 2199 tests from 285 test cases ran. (343419 ms total)
[ PASSED ] 2195 tests.
[ FAILED ] 4 tests, listed below:
[ FAILED ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] BatchReindexLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice
[ FAILED ] BatchReindexLayerTest/3.TestGradient, where TypeParam = caffe::GPUDevice

Try add option "-G" to Makefile, does not fix... in any case 3 test failed..
I try train ResNet 34, and ResNet 18, network, (for make it faster on 6 images) and next try to run it on CPU. It is not working after training... But mnist, normal working, and simplified bottleneck ResNet50 too working. Don't know depended it this unit tests or not....

I rebuild with CUDA 8.0 + cuDNN 6.0.21, with disabled OpenCV, and all tests passed.
But, before i use OpenCV 2.4 from ubuntu repo, not 3.3.1.
Without OpenCV ofcose i fave not ImageData layer, it use imread, and cv::mat to load image files, and it is not good for me.

Ok, i solve all tests, and now training/run all network work fine. In CPU and in GPU same.
[ FAILED ] NetTest/0.TestFromTo, where TypeParam = caffe::CPUDevice
[ FAILED ] NetTest/1.TestFromTo, where TypeParam = caffe::CPUDevice

This 2 falue depended to last version MKL 2018.2.199. When i replace it to Atlas, it fine working, and this 2 test passed.
CUDA 8.0 + cuDNN 6.0.21 + Atlas 3.10.2-9, work for me....

Confirmed.
Debian SId,
GCC-6/CUDA 9.1/Nvidia 390.48

Has anyone tested CUDA 9.2?

@CDLuminate All tests passed with latest commit + CUDA 9.2 + gcc 7.3.1

@xkszltl Thanks. That means I can remove the temporary fix from Debian/Ubuntu's pre-built binary package as long as CUDA 9.2 is available. With -G enabled for nvcc, the performance drop looks significant ...

@CDLuminate
Don't....simply trust me....
Experience may vary by system and...luck...๑乛◡乛๑

BTW I'm on CentOS

Not working here with latest commit + libcudnn7 (7.1.4.18-1+cuda9.2) + cuda 9.2 + gcc (5.4)
=/

All tests passed with commit 864520713a4c5ffae7382ced5d34e4cadc608473 + CentOS 7.5.1804 + CUDA 9.2 + CUDNN 7.1 + gcc 4.8.5!

Not working for me. Details of problem in following link :
https://github.com/BVLC/caffe/issues/6686

@meriem87 You issue looks unrelated to this.

$ make clean & make all & make test & make runtest

Are you root?

No, you aren't.
You are a normal user with GPU access.

On Tue, Feb 4, 2020, 23:39 Swjtu-only notifications@github.com wrote:

$ make clean & make all & make test & make runtest

Are you root?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/BVLC/caffe/issues/6164?email_source=notifications&email_token=ABSIJETW2FBOCDZVSMXDWT3RBIRG3A5CNFSM4ELJ53X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZ6ECQ#issuecomment-582214154,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABSIJESGNZL4GQUBPH5NKRLRBIRG3ANCNFSM4ELJ53XQ
.

No, you aren't. You are a normal user with GPU access.

On Tue, Feb 4, 2020, 23:39 Swjtu-only @.*> wrote: $ make clean & make all & make test & make runtest Are you root? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6164?email_source=notifications&email_token=ABSIJETW2FBOCDZVSMXDWT3RBIRG3A5CNFSM4ELJ53X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZ6ECQ#issuecomment-582214154>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSIJESGNZL4GQUBPH5NKRLRBIRG3ANCNFSM4ELJ53XQ .

Thanks,if i am a normal GPU access,i will meet permissions issue.
So,i decided run one by one and lucky everything is ok for me.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

erogol picture erogol  ·  3Comments

kelvinxu picture kelvinxu  ·  3Comments

serimp picture serimp  ·  3Comments

inferrna picture inferrna  ·  3Comments

OpenHero picture OpenHero  ·  3Comments