Ray: no CUDA-capable device is detected

Created on 7 Nov 2018 · 25Comments · Source: ray-project/ray

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): docker Ubuntu 16.04 image
Ray installed from (source or binary): pip
Ray version: 0.5.3
Python version: Python 3.5.6 :: Anaconda, Inc.
Exact command to reproduce:

Describe the problem

Trying to setup a rllib ppo agent with husky_env from Gibson Env
The script I ran can be found here

I am getting the following Error when calling agent.train():

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=74 error=38 : no CUDA-capable device is detected

Gibson does the environment rendering upon environment creation, and rllib agent's seems to invoke env_creator every time train() is called. I originally thought that was the issue but I don't think it is the case
I tried using gpu_fraction, didn't work. Not sure what is causing the problem.

nvidia-smi

root@e6b154065e88:~/mount/gibson/examples/train# nvidia-smi
Wed Nov  7 09:59:00 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:04:00.0  On |                  N/A |
| 22%   42C    P8    20W / 250W |   2385MiB / 12198MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

torch.cuda.device_count()

root@e6b154065e88:~# python -c "import torch
print(torch.cuda.device_count())
print(torch.cuda.current_device())"
1
0

nvcc --version

root@e6b154065e88:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

To Reproduce

Get Nvidia-Docker2

https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)

#Ubuntu Installation
sudo apt-get install nvidia-docker2
sudo pkill -SIGHUP dockerd

Download Gibson's dataset

```
wget https://storage.googleapis.com/gibsonassets/dataset.tar.gz
tar -zxf dataset.tar.gz

##### Pull Gibson's image

docker pull xf1280/gibson:0.3.1

##### Run it in Docker
replace `<dataset-absolute-path>` with the absolute path to the Gibson dataset you've unzipped on your local machine

docker run --runtime=nvidia -ti --name gibson -v :/root/mount/gibson/gibson/assets/dataset -p 5001:5001 xf1280/gibson:0.3.1

##### Add in the ray_husky.py script
Copy the [`ray_husky.py`](https://github.com/jhpenger/GibsonEnv/blob/master/examples/train/ray_husky.py) found [here](https://github.com/jhpenger/GibsonEnv/blob/master/examples/train/ray_husky.py) to `~/mount/gibson/examples/train/` directory in the docker container.

Run: `python ray_husky.py`






### Full Log
<!-- Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem. -->

root@e6b154065e88:~/mount/gibson/examples/train# python test.py
Unexpected end of /proc/mounts line overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/4IFU7EUC3V2BOPDL2NFLW6T7BY:/var/lib/docker/overlay2/l/3GWVT6ULAU6NJP6MLTBNN56WBQ:/var/lib/docker/overlay2/l/CLLJDJFTZ2FMCKCN6B3WMCSXKG:/var/lib/docker/overlay2/l/QCO5RAE5DXB7MGGYLTK3YULY2O:/var/lib/docker/overlay2/l/NFJ7MEC3G7XLHLZMZWKKHLIM5Y:/var/lib/docker/overlay2/l/3LGFVLYHAWSN7GNAOYGCWVQK3Y:/var/lib/docker/overlay2/l/Q2BQDGXUX3SFP3RQYQDXOPWPSD:/var/lib/docker/overlay2/l/O5I6APSGOJZV4RFU7EOXVT5BWD:/var/lib/docker/overlay2/l/E4DOAELV7FPI6' Unexpected end of /proc/mounts line7XTB5ASEF7ESL:/var/lib/docker/overlay2/l/4BPII7VWNXTHZDYHMZQQ47WVGK:/var/lib/docker/overlay2/l/5RZ3I4FBOEGIAACNUMNPNJIIMM:/var/lib/docker/overlay2/l/JUDMTQV6ZO3CYJ64OCHUEOIDS4:/var/lib/docker/overlay2/l/WXFZP4STEX7JZ5S5VQCQR2MTDB:/var/lib/docker/overlay2/l/MUODDE6AS2PD6QOD6BXFE5JWN4:/var/lib/docker/overlay2/l/NV2EHBVA5EICRKTEGR3F4NADEC:/var/lib/docker/overlay2/l/MZVP7SBXRC7X7IKJKYHYQK6YOK:/var/lib/docker/overlay2/l/SVE4WWKXOSQOO2O3QQDMHW5TVB:/var/lib/docker/overlay2/l/NDRFI4BJ3ZGXEYSVAABQB6Z2OQ:/var/lib/do'
Unexpected end of /proc/mounts line cker/overlay2/l/YTU432I3FDCY7GE4NT5VVR47GN:/var/lib/docker/overlay2/l/VCTBKUJHFQQQTCZRSPPZQKDIDZ:/var/lib/docker/overlay2/l/TR4DD4VR545GC7WIKUS5UDNRSM:/var/lib/docker/overlay2/l/BFRVMK6XAWSUK4JFRBYEOWQA4B:/var/lib/docker/overlay2/l/DLRGX3CDMNWDK66CSZZNXMTRTP:/var/lib/docker/overlay2/l/IPOZCPD7GVR3P3ECGOTQWPJ737:/var/lib/docker/overlay2/l/X6WEEMZQY3LGKMQELCNCCWVVHH:/var/lib/docker/overlay2/l/7APKFGZZGMNJ7BXSRL7A3WFVI6:/var/lib/docker/overlay2/l/PE6OSOUQSWBVJMTELFCNCFEG7X:/var/lib/docker/overlay2/l/FHHGDNFDT' Unexpected end of /proc/mounts lineA32ESWYKQJTKH77LR:/var/lib/docker/overlay2/l/VEP2IVXB7LSMARPAJOF2SGEWTA:/var/lib/docker/overlay2/l/EAPK6KKCRU7YHHL6QVKDLQKSAH:/var/lib/docker/overlay2/l/5SZECZZ64ECDDARDWCQ2QOH2PY:/var/lib/docker/overlay2/l/XAL23ADNRDHSDATFJJSD3HA5T2:/var/lib/docker/overlay2/l/V7MN4H5N26LKKYRY4JGORHE4PI:/var/lib/docker/overlay2/l/3E3ILIVYCBQ52OYJLKCSZXAYPD:/var/lib/docker/overlay2/l/B4GW3N34A6DMEUWEO24TKYCJIW:/var/lib/docker/overlay2/l/XM3K5GW7VB5HRODVU7CTK5HUGD:/var/lib/docker/overlay2/l/7QHY2DH3GUNNMTOYULZIOK6F6O:/var/li'
pybullet build time: Sep 27 2018 00:17:23
pygame 1.9.4
Hello from the pygame community. https://www.pygame.org/contribute.html
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:46828 to respond...
Waiting for redis server at 127.0.0.1:15517 to respond...
Warning: Reducing object store memory because /dev/shm has only 67104768 bytes available. You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 0.00 GB memory.
Starting local scheduler with the following resources: {'CPU': 32, 'GPU': 1}.
Failed to start the UI, you may need to run 'pip install jupyter'.
Created LogSyncer for /root/ray_results/PPO_test_2018-11-07_09-49-37kxrhxuku -> None
/root/mount/gibson/examples/train/../configs/husky_navigate_rgb_train.yaml
WARN: gym.spaces.Box autodetected dtype as . Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as . Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as . Please provide explicit dtype.
Processing the data:
Total 1 scenes 0 train 1 test
Indexing
0%| | 0/1 [00:00 Loaded EGL 1.5 after reload.
GL_VENDOR=NVIDIA Corporation
GL_RENDERER=GeForce GTX TITAN X/PCIe/SSE2
GL_VERSION=4.6.0 NVIDIA 410.73
GL_SHADING_LANGUAGE_VERSION=4.60 NVIDIA
finish loading shaders
100%|#########################################################################################################################################################################| 1/1 [00:00<00:00, 1.99it/s]
9%|###############7 | 18/190 [00:01<02:14, 1.28it/s]terminate called after throwing an instance of 'zmq::error_t'
what(): Address already in use
100%|#####################################################################################################################################################################| 190/190 [00:12<00:00, 16.75it/s]
/root/mount/gibson/gibson/core/render/pcrender.py:204: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
self.imgv = Variable(torch.zeros(1, 3 , self.showsz, self.showsz), volatile = True).cuda()
/root/mount/gibson/gibson/core/render/pcrender.py:205: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
self.maskv = Variable(torch.zeros(1,2, self.showsz, self.showsz), volatile = True).cuda()
Episode: steps:0 score:0
Episode count: 0
/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/functional.py:995: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
Episode: steps:0 score:0
Episode count: 1
LocalMultiGPUOptimizer devices ['/gpu:0']
Unexpected end of /proc/mounts line overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/4IFU7EUC3V2BOPDL2NFLW6T7BY:/var/lib/docker/overlay2/l/3GWVT6ULAU6NJP6MLTBNN56WBQ:/var/lib/docker/overlay2/l/CLLJDJFTZ2FMCKCN6B3WMCSXKG:/var/lib/docker/overlay2/l/QCO5RAE5DXB7MGGYLTK3YULY2O:/var/lib/docker/overlay2/l/NFJ7MEC3G7XLHLZMZWKKHLIM5Y:/var/lib/docker/overlay2/l/3LGFVLYHAWSN7GNAOYGCWVQK3Y:/var/lib/docker/overlay2/l/Q2BQDGXUX3SFP3RQYQDXOPWPSD:/var/lib/docker/overlay2/l/O5I6APSGOJZV4RFU7EOXVT5BWD:/var/lib/docker/overlay2/l/E4DOAELV7FPI6' Unexpected end of /proc/mounts line7XTB5ASEF7ESL:/var/lib/docker/overlay2/l/4BPII7VWNXTHZDYHMZQQ47WVGK:/var/lib/docker/overlay2/l/5RZ3I4FBOEGIAACNUMNPNJIIMM:/var/lib/docker/overlay2/l/JUDMTQV6ZO3CYJ64OCHUEOIDS4:/var/lib/docker/overlay2/l/WXFZP4STEX7JZ5S5VQCQR2MTDB:/var/lib/docker/overlay2/l/MUODDE6AS2PD6QOD6BXFE5JWN4:/var/lib/docker/overlay2/l/NV2EHBVA5EICRKTEGR3F4NADEC:/var/lib/docker/overlay2/l/MZVP7SBXRC7X7IKJKYHYQK6YOK:/var/lib/docker/overlay2/l/SVE4WWKXOSQOO2O3QQDMHW5TVB:/var/lib/docker/overlay2/l/NDRFI4BJ3ZGXEYSVAABQB6Z2OQ:/var/lib/do'
Unexpected end of /proc/mounts line cker/overlay2/l/YTU432I3FDCY7GE4NT5VVR47GN:/var/lib/docker/overlay2/l/VCTBKUJHFQQQTCZRSPPZQKDIDZ:/var/lib/docker/overlay2/l/TR4DD4VR545GC7WIKUS5UDNRSM:/var/lib/docker/overlay2/l/BFRVMK6XAWSUK4JFRBYEOWQA4B:/var/lib/docker/overlay2/l/DLRGX3CDMNWDK66CSZZNXMTRTP:/var/lib/docker/overlay2/l/IPOZCPD7GVR3P3ECGOTQWPJ737:/var/lib/docker/overlay2/l/X6WEEMZQY3LGKMQELCNCCWVVHH:/var/lib/docker/overlay2/l/7APKFGZZGMNJ7BXSRL7A3WFVI6:/var/lib/docker/overlay2/l/PE6OSOUQSWBVJMTELFCNCFEG7X:/var/lib/docker/overlay2/l/FHHGDNFDT' Unexpected end of /proc/mounts lineA32ESWYKQJTKH77LR:/var/lib/docker/overlay2/l/VEP2IVXB7LSMARPAJOF2SGEWTA:/var/lib/docker/overlay2/l/EAPK6KKCRU7YHHL6QVKDLQKSAH:/var/lib/docker/overlay2/l/5SZECZZ64ECDDARDWCQ2QOH2PY:/var/lib/docker/overlay2/l/XAL23ADNRDHSDATFJJSD3HA5T2:/var/lib/docker/overlay2/l/V7MN4H5N26LKKYRY4JGORHE4PI:/var/lib/docker/overlay2/l/3E3ILIVYCBQ52OYJLKCSZXAYPD:/var/lib/docker/overlay2/l/B4GW3N34A6DMEUWEO24TKYCJIW:/var/lib/docker/overlay2/l/XM3K5GW7VB5HRODVU7CTK5HUGD:/var/lib/docker/overlay2/l/7QHY2DH3GUNNMTOYULZIOK6F6O:/var/li'
pybullet build time: Sep 27 2018 00:17:23
pygame 1.9.4
Hello from the pygame community. https://www.pygame.org/contribute.html
/root/mount/gibson/examples/train/../configs/husky_navigate_rgb_train.yaml
WARN: gym.spaces.Box autodetected dtype as . Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as . Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as . Please provide explicit dtype.
Processing the data:
Total 1 scenes 0 train 1 test
Indexing
0%| | 0/1 [00:00 Loaded EGL 1.5 after reload.
GL_VENDOR=NVIDIA Corporation
GL_RENDERER=GeForce GTX TITAN X/PCIe/SSE2
GL_VERSION=4.6.0 NVIDIA 410.73
GL_SHADING_LANGUAGE_VERSION=4.60 NVIDIA
finish loading shaders
100%|#########################################################################################################################################################################| 1/1 [00:00<00:00, 1.74it/s]
11%|#################4 | 20/190 [00:02<00:47, 3.56it/s]terminate called after throwing an instance of 'zmq::error_t'
what(): Address already in use
100%|#####################################################################################################################################################################| 190/190 [00:12<00:00, 16.88it/s]
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=74 error=38 : no CUDA-capable device is detected
Remote function __init__ failed with:

Traceback (most recent call last):
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 945, in _process_task
*arguments)
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/actor.py", line 261, in actor_method_executor
method_returns = method(actor, *args)
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 178, in __init__
self.env = env_creator(env_context)
File "w.py", line 36, in
register_env(env_name, lambda _ : getGibsonEnv())
File "w.py", line 29, in getGibsonEnv
config=config_file)
File "/root/mount/gibson/gibson/envs/husky_env.py", line 40, in __init__
self.robot_introduce(Husky(self.config, env=self))
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 349, in robot_introduce
self.setup_rendering_camera()
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 376, in setup_rendering_camera
self.setup_camera_pc()
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 636, in setup_camera_pc
env = self)
File "/root/mount/gibson/gibson/core/render/pcrender.py", line 172, in __init__
comp = torch.nn.DataParallel(comp).cuda()
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in cuda
return self._apply(lambda t: t.cuda(device))
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 191, in _apply
param.data = fn(param.data)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:74

Remote function set_global_vars failed with:

Traceback (most recent call last):
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 923, in _process_task
self.reraise_actor_init_error()
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 267, in reraise_actor_init_error
raise self.actor_init_error
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 945, in _process_task
*arguments)
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/actor.py", line 261, in actor_method_executor
method_returns = method(actor, *args)
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 178, in __init__
self.env = env_creator(env_context)
File "w.py", line 36, in
register_env(env_name, lambda _ : getGibsonEnv())
File "w.py", line 29, in getGibsonEnv
config=config_file)
File "/root/mount/gibson/gibson/envs/husky_env.py", line 40, in __init__
self.robot_introduce(Husky(self.config, env=self))
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 349, in robot_introduce
self.setup_rendering_camera()
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 376, in setup_rendering_camera
self.setup_camera_pc()
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 636, in setup_camera_pc
env = self)
File "/root/mount/gibson/gibson/core/render/pcrender.py", line 172, in __init__
comp = torch.nn.DataParallel(comp).cuda()
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in cuda
return self._apply(lambda t: t.cuda(device))
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 191, in _apply
param.data = fn(param.data)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:74

killing
File "w.py", line 68, in
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/agents/agent.py", line 233, in train
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/utils/filter_manager.py", line 25, in synchronize
Remote function get_filters failed with:

Traceback (most recent call last):
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 923, in _process_task
self.reraise_actor_init_error()
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 267, in reraise_actor_init_error
raise self.actor_init_error
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 923, in _process_task
self.reraise_actor_init_error()
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 267, in reraise_actor_init_error
raise self.actor_init_error
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 945, in _process_task
*arguments)
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/actor.py", line 261, in actor_method_executor
method_returns = method(actor, *args)
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 178, in __init__
self.env = env_creator(env_context)
File "w.py", line 36, in
register_env(env_name, lambda _ : getGibsonEnv())
File "w.py", line 29, in getGibsonEnv
config=config_file)
File "/root/mount/gibson/gibson/envs/husky_env.py", line 40, in __init__
self.robot_introduce(Husky(self.config, env=self))
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 349, in robot_introduce
self.setup_rendering_camera()
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 376, in setup_rendering_camera
self.setup_camera_pc()
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 636, in setup_camera_pc
env = self)
File "/root/mount/gibson/gibson/core/render/pcrender.py", line 172, in __init__
comp = torch.nn.DataParallel(comp).cuda()
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in cuda
return self._apply(lambda t: t.cuda(device))
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 191, in _apply
param.data = fn(param.data)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:74
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 2514, in get

RayGetError: Could not get objectid ObjectID(4a7d420ef7de86cb813dcb59e2ebc4ece375f9d7). It was created by remote function get_filters which failed with:

Remote function get_filters failed with:

Traceback (most recent call last):
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 923, in _process_task
self.reraise_actor_init_error()
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 267, in reraise_actor_init_error
raise self.actor_init_error
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 923, in _process_task
self.reraise_actor_init_error()
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 267, in reraise_actor_init_error
raise self.actor_init_error
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/worker.py", line 945, in _process_task
*arguments)
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/actor.py", line 261, in actor_method_executor
method_returns = method(actor, *args)
File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 178, in __init__
self.env = env_creator(env_context)
File "w.py", line 36, in
register_env(env_name, lambda _ : getGibsonEnv())
File "w.py", line 29, in getGibsonEnv
config=config_file)
File "/root/mount/gibson/gibson/envs/husky_env.py", line 40, in __init__
self.robot_introduce(Husky(self.config, env=self))
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 349, in robot_introduce
self.setup_rendering_camera()
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 376, in setup_rendering_camera
self.setup_camera_pc()
File "/root/mount/gibson/gibson/envs/env_modalities.py", line 636, in setup_camera_pc
env = self)
File "/root/mount/gibson/gibson/core/render/pcrender.py", line 172, in __init__
comp = torch.nn.DataParallel(comp).cuda()
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in cuda
return self._apply(lambda t: t.cuda(device))
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
module._apply(fn)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 191, in _apply
param.data = fn(param.data)
File "/miniconda/envs/py35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:74

I1107 09:50:22.214844 9899 local_scheduler.cc:178] Killed worker pid 13341 which hadn't started yet.

```

question rllib

Source

jhpenger

Most helpful comment

Yes, so the issue was that CUDA_VISIBLE_DEVICES was being unset from the environment (somehow). Putting os.environ('CUDA_VISIBLE_DEVICES') = '0' fixed the issue.
Thanks everyone!

bmazoure on 10 Jul 2019

❤1 👍1

All 25 comments

Hey @jhpenger , this is because by default we use CPUs only for policy evaluation. Is it necessary to allocate GPUs for the Gibson env to run?

That said, you can allocate GPUs for workers too by setting this conf: https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/ppo/ppo.py#L53
This should work with a fraction too.

Alternatively you can set num_workers: 0 and then the env will be on the driver only and sharing the GPUs allocated via the num_gpus conf.

ericl on 7 Nov 2018

👎1

@ericl num_workers: 0 worked.
How do I set policy evaluation to use only CPUs?

"Is it necessary to allocate GPUs for the Gibson env to run?"

needs GPU to render the environment on creation, and I believe it needs them to run as well

After changing num_workers: 0, I now get the following :

killing <subprocess.Popen object at 0x7fe4dd2fc9e8>
   File "ray_husky.py", line 68, in <module>
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/agents/agent.py", line 235, in train
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/tune/trainable.py", line 143, in train
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/agents/ppo/ppo.py", line 123, in _train
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/optimizers/multi_gpu_optimizer.py", line 104, in step
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/policy_evaluator.py", line 303, in sample
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/sampler.py", line 58, in get_data
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/evaluation/sampler.py", line 279, in _env_runner
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/utils/filter.py", line 217, in __call__
   File "/miniconda/envs/py35/lib/python3.5/site-packages/ray/rllib/utils/filter.py", line 78, in push
 AssertionError: x.shape = (), self.shape = (128, 128, 4)

jhpenger on 7 Nov 2018

How do I set policy evaluation to use only CPUs?

It sounds like in your case this won't work, since policy evaluation will create a copy of your environment. So you need to allocate GPUs via num_gpus_per_worker: N where N could be 1 or a fraction.

 AssertionError: x.shape = (), self.shape = (128, 128, 4)

This means that your env is returning a scalar observation when it expected a shape of (128, 128, 4). Maybe check your env step/reset() return values, and also that env.observation_space.contains(obs) is true for the obs you return?

ericl on 7 Nov 2018

Thanks a lot, that helped.
Got it working now (Gibson returns a dictionary instead of array for their step() & reset(), since it returns multiple observations per step: depth field, rgb, etc...)

@ericl
Want some clarification on num_workers. Does num_workers: n mean n additional workers instead of n workers in total? because it seems to spin up n + 1 workers.
I want to train with exactly 1 worker right now since I'm testing on a machine with only 1 GPU and each Gibson Env requires a GPU. But num_workers = 1 creates worker0 and worker1 causing that cuda error. I'm just a bit confused about what num_workers: 0 is, since it trains fine. Is it equal to "my idea" of 1 worker?

I will work on this more in a few days, might have more questions

jhpenger on 10 Nov 2018

n workers total. You're probably seeing the additional CPU used for the driver, which is its separate process.
If you want to train with 1 env, you should use num_workers: 0. Some algorithms like A3C require a positive number of workers but others like PPO are fine with 0. In that case RLlib runs in a single process only.

ericl on 10 Nov 2018

Btw, could you share more details on how you were able to get AssertionError: x.shape = ()? I'd like to add a better warning for that.

Edit: Actually, I think this should be fixed in master, since we now support DictSpace.

ericl on 10 Nov 2018

@ericl Sorry for the late response. I fixed it by manually changing Gibson Environment's output from dict to Array.

Edit: Actually, I think this should be fixed in master, since we now support DictSpace.

It would be great that the current Ray supports DictSpace. How recently was this added? Because this wasn't available in the version of Ray I was running.
Have not tested it with newest Ray version yet, since I am getting some errors, which I had not looked into, even after rolling back the redis version to fix the compatibility issue caused by PR3333

jhpenger on 18 Nov 2018

@jhpenger as of https://github.com/ray-project/ray/pull/3051 (master)

ericl on 18 Nov 2018

@ericl I think I know what the problem was before. I think, in the older older ray version, gpu_fraction only sets the gpu resource for the driver. And I think alg = ppo.PPOAgent(config=config, env=env_name) spins up all the environments in parallel; so, the driver was using fractional GPU while the workers were each using an entire GPU. that's why I was getting a CUDA no device.

I was able to run multiple Gibson Env with remote functions when I specify appropriate fractional GPU resources. *

I'm trying to use xray right now, which has num_gpus and num_gpus_poer_worker.
Although the documentation says gpu_fraction is deprecated and I can now set fractional GPU resources under num_gpus, I am getting an error saying that num_gpus must be interger.
How do I fix this?

killing <subprocess.Popen object at 0x7f349127aa58>
   File "test.py", line 68, in <module>
   File "/root/ray/python/ray/rllib/agents/agent.py", line 297, in __init__
   File "/root/ray/python/ray/tune/trainable.py", line 87, in __init__
   File "/root/ray/python/ray/rllib/agents/agent.py", line 344, in _setup
   File "/root/ray/python/ray/rllib/agents/ppo/ppo.py", line 85, in _init
   File "/root/ray/python/ray/rllib/optimizers/policy_optimizer.py", line 54, in __init__
   File "/root/ray/python/ray/rllib/optimizers/multi_gpu_optimizer.py", line 47, in _init
 TypeError: 'float' object cannot be interpreted as an integer

jhpenger on 25 Nov 2018

3394

jhpenger on 25 Nov 2018

@ericl btw, I finally got around to test if ray accepts dictionary of observations, it doesn't.
The updated error message is better for sure though, outputs the entire dictionary that ray is not accepting.
I'm using your frac_ppo branch version of ray.

jhpenger on 26 Nov 2018

Can you post your script? Try following the examples in this test:
https://github.com/ray-project/ray/blob/master/python/ray/rllib/test/test_nested_spaces.py

On Sun, Nov 25, 2018, 10:17 PM jhpenger notifications@github.com wrote:

@ericl https://github.com/ericl btw, I finally got around to test if
ray accepts dictionary of observations, it doesn't.
The updated error message is better for sure though, outputs the entire
dictionary that ray is not accepting.
I'm using your frac_ppo branch version of ray.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/3265#issuecomment-441530303,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6StJ2xBGRRYKp1WXp5SJPc0tSE_E-ks5uy4d_gaJpZM4YSN0g
.

ericl on 26 Nov 2018

Note in particular that you have to implement _build_layers_v2.

On Sun, Nov 25, 2018, 11:01 PM Eric Liang ekhliang@gmail.com wrote:

Can you post your script? Try following the examples in this test:
https://github.com/ray-project/ray/blob/master/python/ray/rllib/test/test_nested_spaces.py

On Sun, Nov 25, 2018, 10:17 PM jhpenger notifications@github.com wrote:

@ericl https://github.com/ericl btw, I finally got around to test if
ray accepts dictionary of observations, it doesn't.
The updated error message is better for sure though, outputs the entire
dictionary that ray is not accepting.
I'm using your frac_ppo branch version of ray.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/3265#issuecomment-441530303,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6StJ2xBGRRYKp1WXp5SJPc0tSE_E-ks5uy4d_gaJpZM4YSN0g
.

ericl on 26 Nov 2018

I don't know if the issue has been completely solved, but since it is marked as open, I will write here. The following command rllib train --run PG --env CartPole-v0 --config='{"use_pytorch":true,"num_gpus":1,"num_workers":0}', when executed, returns an no CUDA-capable device is detected error. Here is the detailed error:

2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002: Traceback (most recent call last):
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/tune/trial_runner.py", line 443, in _process_trial
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     result = self.trial_executor.fetch_result(trial)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/tune/ray_trial_executor.py", line 315, in fetch_result
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     result = ray.get(trial_future[0])
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/worker.py", line 2192, in get
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     raise value
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002: ray.exceptions.RayTaskError: [36mray_worker[39m (pid=211, host=container-e559-1557431881457-79101-01-000002)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/trainer.py", line 293, in __init__
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     Trainable.__init__(self, config, logger_creator)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/tune/trainable.py", line 88, in __init__
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     self._setup(copy.deepcopy(self.config))
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/trainer.py", line 393, in _setup
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     self._init(self.config, self.env_creator)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/pg/pg.py", line 45, in _init
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     env_creator, policy_cls)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/trainer.py", line 591, in make_local_evaluator
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     extra_config or {}))
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/trainer.py", line 810, in _make_evaluator
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     _fake_sampler=config.get("_fake_sampler", False))
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/evaluation/policy_evaluator.py", line 324, in __init__
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     policy_dict, policy_config)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/evaluation/policy_evaluator.py", line 728, in _build_policy_map
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     policy_map[name] = cls(obs_space, act_space, merged_conf)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/agents/pg/torch_pg_policy_graph.py", line 69, in __init__
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     action_distribution_cls=dist_class)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/workspace/ray/python/ray/rllib/evaluation/torch_policy_graph.py", line 61, in __init__
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     self._model = model.to(self.device)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 379, in to
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     return self._apply(convert)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     module._apply(fn)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     module._apply(fn)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     module._apply(fn)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   [Previous line repeated 1 more time]
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 191, in _apply
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     param.data = fn(param.data)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 377, in convert
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:     return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002: RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /opt/conda/conda-bld/pytorch_1532579805626/work/aten/src/THC/THCGeneral.cpp:74
2019-07-03T22:14:39.000Z /container_e559_1557431881457_79101_01_000002:

If it changes anything, everything is running in a Docker container. The TensorFlow version (use_pytorch:false), as well as PyTorch code on CPU work well.
Does anyone have any idea about what might be hapenning? Thanks in advance

bmazoure on 4 Jul 2019

Is nvidia docker enabled?

richardliaw on 4 Jul 2019

@richardliaw I believe so, will double check a little later. The strange thing is that the tensorflow code trains fine. When I run print("Found %d CUDA devices"%torch.cuda.device_count()) in the code, it prints Found 2 CUDA devices if I give it 2 GPUs. On the other hand, if within a TorchPolicyGraph.__init__ I print torch.cuda.is_available(), it returns False.

bmazoure on 4 Jul 2019

Makes sense. Want to submit a patch?

On Thu, Jul 4, 2019, 7:17 AM Bogdan Mazoure notifications@github.com
wrote:

Update: I managed to solve the issue by overriding
TorchPolicyGraph.__init__ and changing bool(os.environ.get("CUDA_VISIBLE_DEVICES",
None) to torch.cuda.is_available().

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/3265?email_source=notifications&email_token=AAADUSTZWOMXBWJTP5SLZSDP5YA6LA5CNFSM4GCI3UQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZHRBZY#issuecomment-508498151,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAADUSWSLGXVARP773UIT53P5YA6LANCNFSM4GCI3UQA
.

ericl on 4 Jul 2019

After investigating further, changing theself.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') does make the code run, but it still uses the CPU without errors. Any idea what could cause torch.cuda.is_available() to return True after ray.init, but to return False inside Trainer.__init__? I assume it has something to do with the GPU not being visible from inside Ray, which makes the model mapped onto CPU?

bmazoure on 4 Jul 2019

Ray will automatically set CUDA_VISIBLE_DEVICES inside the actor processes based on the gpu configuration. For example: rllib train --run PG --env CartPole-v0 --config='{"use_pytorch": true, "num_gpus": 1, "num_workers": 0} will allocate 1 gpu device (so CUDA_VISIBLE_DEVICES will probably be set to something like "0", and torch.cuda.is_available() will return True).

I just tried running that command with ray==0.7.1 and latest and see non-zero GPU utilization, is that different from what you're trying?

Note that, if num_workers > 0, then the gpus assigned to workers are controlled by "num_gpus_per_worker". Usually you don't want to assign GPUs to workers, since inference is efficient enough with CPUs. So the GPUs specified by num_gpus are only used for the learner. num_workers==0 is a special case where both inference and learning is done in the same process.

ericl on 6 Jul 2019

Yes, so the issue was that CUDA_VISIBLE_DEVICES was being unset from the environment (somehow). Putting os.environ('CUDA_VISIBLE_DEVICES') = '0' fixed the issue.
Thanks everyone!

bmazoure on 10 Jul 2019

❤1 👍1

same here and setting CUDA_VISIBLE_DEVICES is not working. If I run the training script without ray, it's working fine.

cagbal on 28 Aug 2019

Just ran into the same problem training a CNN with tune.run:

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): PopOS (Ubuntu 18.04)
Ray installed from (source or binary): pip
Ray version: 0.7.3
Python version: Python 3.7.3 :: Anaconda, Inc.

Everything was working, then I added validate_save_restore before calling run:

validate_save_restore(MyAgent, use_object_store=True, config={
    "args": config,
    "lr": 0.01,
    "momentum": 0.9,
    "weight_decay": 0.001,
    "step_size": 31,
    "gamma": 0.001
})

Which later causes torch.cuda.is_available() to return False in the tune.run workers. After removing it everything works again. Maybe that is what is causing it?

sevro on 3 Oct 2019

Closing this issue because it seems like this is working. Please reopen if not.

richardliaw on 14 Apr 2020

👎1

@richardliaw I am seeing a similar issue with Ray serve on a p3.16xlarge EC2 instance. Looks like nccl, nvidia-smi, torch.cuda.device_count(), etc is working. I am using @simon-mo 's script here: https://gist.github.com/simon-mo/b5be0b95d6b79f27780d569073f5588a

I tried https://github.com/ray-project/ray/issues/3265#issuecomment-510215566 but it gave me SyntaxError: can't assign to function call

EDIT: solved by setting serve.create_backend(…,backend_config=serve.BackendConfig(…, num_gpus=1, num_replicas=num_total_gpus)

crypdick on 14 May 2020

Yes, so the issue was that CUDA_VISIBLE_DEVICES was being unset from the environment (somehow). Putting os.environ('CUDA_VISIBLE_DEVICES') = '0' fixed the issue.
Thanks everyone!

thank you very much!