Compose: Support for NVIDIA GPUs under Docker Compose

Created on 9 May 2019  ·  160Comments  ·  Source: docker/compose

Under Docker 19.03.0 Beta 2, support for NVIDIA GPU has been introduced in the form of new CLI API --gpus. https://github.com/docker/cli/pull/1714 talk about this enablement.

Now one can simply pass --gpus option for GPU-accelerated Docker based application.

$ docker run -it --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
f476d66f5408: Pull complete 
8882c27f669e: Pull complete 
d9af21273955: Pull complete 
f5029279ec12: Pull complete 
Digest: sha256:d26d529daa4d8567167181d9d569f2a85da3c5ecaf539cace2c6223355d69981
Status: Downloaded newer image for ubuntu:latest
Tue May  7 15:52:15 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116                Driver Version: 390.116                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    22W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
:~$ 

As of today, Compose doesn't support this. This is a feature request for enabling Compose to support for NVIDIA GPU.

kinenhancement statu0-triage

Most helpful comment

It is an urgent need. Thank you for your effort!

All 160 comments

This is of increased importance now that the (now) legacy 'nvidia runtime' appears broken with Docker 19.03.0 and nvidia-container-toolkit-1.0.0-2: https://github.com/NVIDIA/nvidia-docker/issues/1017

$ cat docker-compose.yml 
version: '2.3'

services:
 nvidia-smi-test:
  runtime: nvidia
  image: nvidia/cuda:9.2-runtime-centos7

$ docker-compose run nvidia-smi-test
Cannot create container for service nvidia-smi-test: Unknown runtime specified nvidia

This works: docker run --gpus all nvidia/cudagl:9.2-runtime-centos7 nvidia-smi

This does not: docker run --runtime=nvidia nvidia/cudagl:9.2-runtime-centos7 nvidia-smi

Any work happening on this?

I got the new Docker CE 19.03.0 on a new Ubuntu 18.04 LTS machine, have the current and matching NVIDIA Container Toolkit (née nvidia-docker2) version, but cannot use it because docker-compose.yml 3.7 doesn't support the --gpus flag.

Is there a workaround for this?

This works: docker run --gpus all nvidia/cudagl:9.2-runtime-centos7 nvidia-smi

This does not: docker run --runtime=nvidia nvidia/cudagl:9.2-runtime-centos7 nvidia-smi

You need to have

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

in your /etc/docker/daemon.json for --runtime=nvidia to continue working. More info here.

ping @KlaasH @ulyssessouza @Goryudyuma @chris-crone . Any update on this?

It is an urgent need. Thank you for your effort!

Is it intended to have user manually populate /etc/docker/daemon.json after migrating to docker >= 19.03 and removing nvidia-docker2 to use nvidia-container-toolkit instead?

It seems that this breaks a lot of installations. Especially, since --gpus is not available in compose.

No, this is a work around for until compose does support the gpus flag.

install nvidia-docker-runtime:
https://github.com/NVIDIA/nvidia-container-runtime#docker-engine-setup
add to /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}

docker-compose:
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all

There is no such thing like "/usr/bin/nvidia-container-runtime" anymore. Issue is still critical.

it will help run nvidia environment with docker-compose, untill fix docker-compose

install nvidia-docker-runtime:
https://github.com/NVIDIA/nvidia-container-runtime#docker-engine-setup
add to /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}

docker-compose:
runtime: nvidia
environment:

  • NVIDIA_VISIBLE_DEVICES=all

This is not working for me, still getting the Unsupported config option for services.myservice: 'runtime' when trying to run docker-compose up

any ideas?

This is not working for me, still getting the Unsupported config option for services.myservice: 'runtime' when trying to run docker-compose up

any ideas?

after modify /etc/docker/daemon.json, restart docker service
systemctl restart docker
use Compose format 2.3 and add runtime: nvidia to your GPU service. Docker Compose must be version 1.19.0 or higher.
docker-compose file:
version: '2.3'

services:
nvsmi:
image: ubuntu:16.04
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
command: nvidia-smi

@cheperuiz, you can set nvidia as default runtime in daemon.json and will not be dependent on docker-compose. But all you docker containers will use nvidia runtime - I have no issues so far.
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, }

Ah! thank you @Kwull , i missed that default-runtime part... Everything working now :)

@uderik, runtime is no longer present in the current 3.7 compose file format schema, nor in the pending 3.8 version that should eventually align with Docker 19.03: https://github.com/docker/compose/blob/5e587d574a94e011b029c2fb491fb0f4bdeef71c/compose/config/config_schema_v3.8.json

@johncolby runtime has never been a 3.x flag. It's only present in the 2.x track, (2.3 and 2.4).

Yeah, I know, and even though my docker-compose.yml file includes the version: '2.3' (which have worked in the past) it seems to be ignored by the latest versions...
For future projects, what would be the correct way to enable/disable access to the GPU? just making it default + env variables? or will there be support for the --gpus flag?

@johncolby what is the replacement for runtime in 3.X?

@Daniel451 I've just been following along peripherally, but it looks like it will be under the generic_resources key, something like:

services:
  my_app:
    deploy:
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec:
                kind: 'gpu'
                value: 2

(from https://github.com/docker/cli/blob/9a39a1/cli/compose/loader/full-example.yml#L71-L74)
Design document here: https://github.com/docker/swarmkit/blob/master/design/generic_resources.md

Here is the compose issue regarding compose 3.8 schema support, which is already merged in: https://github.com/docker/compose/issues/6530

On the daemon side the gpu capability can get registered by including it in the daemon.json or dockerd CLI (like the previous hard-coded runtime workaround), something like

/usr/bin/dockerd --node-generic-resource gpu=2

which then gets registered by hooking into the NVIDIA docker utility:
https://github.com/moby/moby/blob/09d0f9/daemon/nvidia_linux.go

It looks like the machinery is basically in place, probably just needs to get documented...

Any update?

Also waiting on updates, using bash with docker run --gpus until the official fix...

Waiting for updates asw ell.

Also waiting for updates :)

Ok... I don't understand why this is still open. These 3 additional lines make it work with schema version 3.7. Glad to know docker is responsive to trivial community issues. So clone this repo, make add these three lines, and python3 setup.py build && install it, and make sure your docker-compose.yml is version 3.7.

[ruckc@omnilap compose]$ git diff
diff --git a/compose/config/config_schema_v3.7.json b/compose/config/config_schema_v3.7.json
index cd7882f5..d25d404c 100644
--- a/compose/config/config_schema_v3.7.json
+++ b/compose/config/config_schema_v3.7.json
@@ -151,6 +151,7 @@

         "external_links": {"type": "array", "items": {"type": "string"}, "uniqueItems": true},
         "extra_hosts": {"$ref": "#/definitions/list_or_dict"},
+        "gpus": {"type": ["number", "string"]},
         "healthcheck": {"$ref": "#/definitions/healthcheck"},
         "hostname": {"type": "string"},
         "image": {"type": "string"},
diff --git a/compose/service.py b/compose/service.py
index 55d2e9cd..71188b67 100644
--- a/compose/service.py
+++ b/compose/service.py
@@ -89,6 +89,7 @@ HOST_CONFIG_KEYS = [
     'dns_opt',
     'env_file',
     'extra_hosts',
+    'gpus',
     'group_add',
     'init',
     'ipc',
@@ -996,6 +997,7 @@ class Service(object):
             dns_opt=options.get('dns_opt'),
             dns_search=options.get('dns_search'),
             restart_policy=options.get('restart'),
+            gpus=options.get('gpus'),
             runtime=options.get('runtime'),
             cap_add=options.get('cap_add'),
             cap_drop=options.get('cap_drop'),

I just added an internal issue to track that.
Remember that PRs are welcome :smiley:

Ok... I don't understand why this is still open. These 3 additional lines make it work with schema version 3.7. Glad to know docker is responsive to trivial community issues. So clone this repo, make add these three lines, and python3 setup.py build && install it, and make sure your docker-compose.yml is version 3.7.

[ruckc@omnilap compose]$ git diff
diff --git a/compose/config/config_schema_v3.7.json b/compose/config/config_schema_v3.7.json
index cd7882f5..d25d404c 100644
--- a/compose/config/config_schema_v3.7.json
+++ b/compose/config/config_schema_v3.7.json
@@ -151,6 +151,7 @@

         "external_links": {"type": "array", "items": {"type": "string"}, "uniqueItems": true},
         "extra_hosts": {"$ref": "#/definitions/list_or_dict"},
+        "gpus": {"type": ["number", "string"]},
         "healthcheck": {"$ref": "#/definitions/healthcheck"},
         "hostname": {"type": "string"},
         "image": {"type": "string"},
diff --git a/compose/service.py b/compose/service.py
index 55d2e9cd..71188b67 100644
--- a/compose/service.py
+++ b/compose/service.py
@@ -89,6 +89,7 @@ HOST_CONFIG_KEYS = [
     'dns_opt',
     'env_file',
     'extra_hosts',
+    'gpus',
     'group_add',
     'init',
     'ipc',
@@ -996,6 +997,7 @@ class Service(object):
             dns_opt=options.get('dns_opt'),
             dns_search=options.get('dns_search'),
             restart_policy=options.get('restart'),
+            gpus=options.get('gpus'),
             runtime=options.get('runtime'),
             cap_add=options.get('cap_add'),
             cap_drop=options.get('cap_drop'),

i tried your solution but I get a lot of errors about that flag:

ERROR: for <SERVICE_NAME>  __init__() got an unexpected keyword argument 'gpus'
Traceback (most recent call last):
  File "/usr/local/bin/docker-compose", line 11, in <module>
    load_entry_point('docker-compose==1.25.0.dev0', 'console_scripts', 'docker-compose')()
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/cli/main.py", line 71, in main
    command()
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/cli/main.py", line 127, in perform_command
    handler(command, command_options)
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/cli/main.py", line 1106, in up
    to_attach = up(False)
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/cli/main.py", line 1102, in up
    cli=native_builder,
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/project.py", line 569, in up
    get_deps,
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/parallel.py", line 112, in parallel_execute
    raise error_to_reraise
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/parallel.py", line 210, in producer
    result = func(obj)
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/project.py", line 555, in do
    renew_anonymous_volumes=renew_anonymous_volumes,
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/service.py", line 546, in execute_convergence_plan
    scale, detached, start
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/service.py", line 468, in _execute_convergence_create
    "Creating"
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/parallel.py", line 112, in parallel_execute
    raise error_to_reraise
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/parallel.py", line 210, in producer
    result = func(obj)
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/service.py", line 466, in <lambda>
    lambda service_name: create_and_start(self, service_name.number),
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/service.py", line 454, in create_and_start
    container = service.create_container(number=n, quiet=True)
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/service.py", line 337, in create_container
    previous_container=previous_container,
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/service.py", line 913, in _get_container_create_options
    one_off=one_off)
  File "/usr/local/lib/python3.6/dist-packages/docker_compose-1.25.0.dev0-py3.6.egg/compose/service.py", line 1045, in _get_container_host_config
    cpu_rt_runtime=options.get('cpu_rt_runtime'),
  File "/usr/local/lib/python3.6/dist-packages/docker-4.0.2-py3.6.egg/docker/api/container.py", line 590, in create_host_config
    return HostConfig(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'gpus'

Do I need a specific python docker package ?

@DarioTurchi Yeah, I met the exact issue. Seems the type of HostConfig needs to be updated also.

I don't believe the change described by @ruckc is sufficient, because docker-py will also need a change. And it looks like the necessary docker-py change is still being worked on. See here:
https://github.com/docker/docker-py/pull/2419

Here is the branch with the changes:
https://github.com/sigurdkb/docker-py/tree/gpus_parameter

So if you wish to patch this in now you'll have to build docker-compose against a modified docker-py from https://github.com/sigurdkb/docker-py/tree/gpus_parameter

I don't get what is going on here:

1) I have in /etc/docker/daemon.json

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

but runtime key cannot be used anymore in v3.x as for https://github.com/docker/compose/issues/6239

I have tried also:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

So I cannot start my containers with gpu support on docker-compose anymore:

bertserving_1    | I:VENTILATOR:[__i:_ge:222]:get devices
bertserving_1    | W:VENTILATOR:[__i:_ge:246]:no GPU available, fall back to CPU

Before those changes it worked, so what can I do now?

+1 it will be very useful to have such feature in docker-compose!

Any eta?

+1 would be useful feature for docker-compose

This feature would be an awesome addition to docker-compose

Right now my solution for this is using 2.3 version of docker-compose file, that support runtime, and manually installing the nvidia-container-runtime (since it is no longer installed with the nvidia-docker).
Also I'm settings the runtime configs in the /etc/docker/daemon.json (not as default, just as an available runtime).
With this I can use a compose file as such:

version: '2.3'
services:
  test:
    image: nvidia/cuda:9.0-base
    runtime: nvidia

Right now my solution for this is using 2.3 version of docker-compose file, that support runtime, and manually installing the nvidia-container-runtime (since it is no longer installed with the nvidia-docker).
Also I'm settings the runtime configs in the /etc/docker/daemon.json (not as default, just as an available runtime).
With this I can use a compose file as such:

version: '2.3'
services:
  test:
    image: nvidia/cuda:9.0-base
    runtime: nvidia

@arruda Would you mind sharing your daemon.json please?

Right now my solution for this is using 2.3 version of docker-compose file, that support runtime, and manually installing the nvidia-container-runtime (since it is no longer installed with the nvidia-docker).
Also I'm settings the runtime configs in the /etc/docker/daemon.json (not as default, just as an available runtime).
With this I can use a compose file as such:

version: '2.3'
services:
  test:
    image: nvidia/cuda:9.0-base
    runtime: nvidia

@arruda Would you mind sharing your daemon.json please?

Yeah, no problem, here it is:

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Hi

I have an application which requires NVIDIA drivers. I have built a docker image based on (FROM)
nvidia/cudagl:10.1-runtime-ubuntu18.04

Using the approach recommended above - does it mean my image does not need to be derived from nvidia/cudagl:10.1-runtime-ubuntu18.04 ? I.e. I could simply derive from (FROM) python:3.7.3-stretch
and add runtime: nvidia to the service in docker-compose ?

Thanks

@rfsch No, that's a different thing. runtime: nvidia in docker-compose refers to the Docker runtime. This makes the GPU available to the container. But you still need some way to use them once they're made available. runtime in nvidia/cudagl:10.1-runtime-ubuntu18.04 refers to the CUDA runtime components. This lets you use the GPUs (made available in a container by Docker) using CUDA.

In this image:

Docker architecture

runtime: nvidia replaces the runc/containerd part. nvidia/cudagl:10.1-runtime-ubuntu18.04 is completely outside the picture.

we need this feature

@Daniel451 I've just been following along peripherally, but it looks like it will be under the generic_resources key, something like:

services:
  my_app:
    deploy:
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec:
                kind: 'gpu'
                value: 2

(from https://github.com/docker/cli/blob/9a39a1/cli/compose/loader/full-example.yml#L71-L74)
Design document here: https://github.com/docker/swarmkit/blob/master/design/generic_resources.md

Here is the compose issue regarding compose 3.8 schema support, which is already merged in: #6530

On the daemon side the gpu capability can get registered by including it in the daemon.json or dockerd CLI (like the previous hard-coded runtime workaround), something like

/usr/bin/dockerd --node-generic-resource gpu=2

which then gets registered by hooking into the NVIDIA docker utility:
https://github.com/moby/moby/blob/09d0f9/daemon/nvidia_linux.go

It looks like the machinery is basically in place, probably just needs to get documented...

Hey, @johncolby, I tried this, but failed:

ERROR: The Compose file './docker-compose.yml' is invalid because:
services.nvidia-smi-test.deploy.resources.reservations value Additional properties are not allowed ('generic_resources' was unexpected)

any suggestions?

Thanks
David

Installing nvidia-container-runtime 3.1.4.1 from https://github.com/NVIDIA/nvidia-container-runtime and putting

runtime: nvidia

works fine here with docker-compose 1.23.1 and 1.24.1 as installed from https://docs.docker.com/compose/install/ using this dodgy looking command:

sudo curl -L "https://github.com/docker/compose/releases/download/1.24.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

and e.g. the nvidia/cudagl/10.1-base container from dockerhub. I've tried cuda and OpenGL rendering and it's all near native performance.

Internally tracked as COMPOSE-82
Please note that such a change need also to be implemented in docker stack (https://github.com/docker/cli/blob/master/cli/compose/types/types.go#L156) for consistency

Installing nvidia-container-runtime 3.1.4.1 from https://github.com/NVIDIA/nvidia-container-runtime and putting

runtime: nvidia

works fine here with docker-compose 1.23.1 and 1.24.1 as installed from https://docs.docker.com/compose/install/ using this dodgy looking command:

sudo curl -L "https://github.com/docker/compose/releases/download/1.24.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

and e.g. the nvidia/cudagl/10.1-base container from dockerhub. I've tried cuda and OpenGL rendering and it's all near native performance.

can you share your docker-compose.yml ?

hey, @jdr-face,

here is my test following your suggestion, by install nvidia-container-runtime at host machine.

version: '3.0'

services:
  nvidia-smi-test:
    runtime: nvidia
    volumes:
      - /tmp/.X11-unix:/tmp/.X11-unix 
    environment:
     - NVIDIA_VISIBLE_DEVICES=0 
     - DISPLAY
    image: vkcube

it still give the error:

       Unsupported config option for services.nvidia-smi-test: 'runtime'

@david-gwa as noted by andyneff earlier:

runtime has never been a 3.x flag. It's only present in the 2.x track, (2.3 and 2.4).

@david-gwa

can you share your docker-compose.yml ?

version: '2.3'

services:
    container:
        image: "nvidia/cudagl/10.1-base"

        runtime: "nvidia" 

        security_opt:
            - seccomp:unconfined
        privileged: true

        volumes:
            - $HOME/.Xauthority:/root/.Xauthority:rw
            - /tmp/.X11-unix:/tmp/.X11-unix:rw

        environment:
          - NVIDIA_VISIBLE_DEVICES=all

Depending on your needs some of those options may be unnecessary. As @muru predicted, the trick is to specify an old version. At least for my use case this isn't a problem, but I only offer this config as a workaround, really it should be made possible using the latest version.

thanks guys, @jdr-face , @muru , compose v2 does work,
I mis-understood your solution is for v3 compose.

For the record, traditionally speaking: compose v2 is not older than compose v3. They are different use cases. v3 is geared towards swarm while v2 is not. v1 is old.

Is there any discussion about the support of Docker-compose for Docker's native GPU support?

Supporting runtime option is not the solution for GPU support in the future. NVIDIA describes about the future of nvidia-docker2 in https://github.com/NVIDIA/nvidia-docker as follows.

Note that with the release of Docker 19.03, usage of nvidia-docker2 packages are deprecated since NVIDIA GPUs are now natively supported as devices in the Docker runtime.

Currently, GPU support can be realized by changing the runtime, but it is highly possible that this will not work in the future.

To be frank, this maybe not the best practise, but somehow we make it work.

The tricky part is that we have to stick with docker-compose v3.x since we are use docker swarm, meanwhile we want to use the Nvidia Runtime to support GPU/CUDA in the containers.

To avoid explicitly telling the Nvidia Runtime inside the docker-compose file, we set the Nvidia as the default runtime in /etc/docker/daemon.json, and it will looks like

{
    "default-runtime":"nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Such that all the containers running on the GPU machines will default enable the Nvidia runtime.

Hope this can help someone facing the similar blocker

To be frank, this maybe not the best practise, but somehow we make it work.

The tricky part is that we have to stick with docker-compose v3.x since we are use docker swarm, meanwhile we want to use the Nvidia Runtime to support GPU/CUDA in the containers.

To avoid explicitly telling the Nvidia Runtime inside the docker-compose file, we set the Nvidia as the default runtime in /etc/docker/daemon.json, and it will looks like

{
    "default-runtime":"nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Such that all the containers running on the GPU machines will default enable the Nvidia runtime.

Hope this can help someone facing the similar blocker

This is indeed what we do as well. It works for now, but it feels a little hacky to me. Hoping for full compose-v3 support soon. :)

Is it intended to have user manually populate /etc/docker/daemon.json after migrating to docker >= 19.03 and removing nvidia-docker2 to use nvidia-container-toolkit instead?

It seems that this breaks a lot of installations. Especially, since --gpus is not available in compose.

--gpus is not available in compose
I can not use pycharm to link docker to run tensorflow-gpu

Any updates on this issue? Is there a chance that the --gpus will be supported in docker-compose soon?

For those of you looking for a workaround this what we ended up doing:

And then run COMPOSE_API_VERSION=auto docker-compose run gpu with the following file:

version: '3.7'

services:
    gpu:
        image: 'nvidia/cuda:9.0-base'
        command: 'nvidia-smi'
        device_requests:
            - capabilities:
               - "gpu"

Under Docker 19.03.0 Beta 2, support for NVIDIA GPU has been introduced in the form of new CLI API --gpus. docker/cli#1714 talk about this enablement.

Now one can simply pass --gpus option for GPU-accelerated Docker based application.

$ docker run -it --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
f476d66f5408: Pull complete 
8882c27f669e: Pull complete 
d9af21273955: Pull complete 
f5029279ec12: Pull complete 
Digest: sha256:d26d529daa4d8567167181d9d569f2a85da3c5ecaf539cace2c6223355d69981
Status: Downloaded newer image for ubuntu:latest
Tue May  7 15:52:15 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116                Driver Version: 390.116                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    22W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
:~$ 

As of today, Compose doesn't support this. This is a feature request for enabling Compose to support for NVIDIA GPU.

I have solved this problems,you can have a try as follows, my csdn blog address: https://blog.csdn.net/u010420283/article/details/104055046

~$ sudo apt-get install nvidia-container-runtime
~$ sudo vim /etc/docker/daemon.json

then , in this daemon.json file, add this content:

{
"default-runtime": "nvidia"
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}

~$ sudo systemctl daemon-reload
~$ sudo systemctl restart docker

For the ansible users who want to setup the workaround described before, there is a role to install nvidia-container-runtime and configure the /etc/docker/deamon.json to use runtime: nvidia:

https://github.com/NVIDIA/ansible-role-nvidia-docker

(for some reason it runs only on Ubuntu and RHEL, but it's quite easy to modify. I run it on Debian)

Then in your docker-compose.yml:

version: "2.4"
services:
  test:
    image: "nvidia/cuda:10.2-runtime-ubuntu18.04"
    command: "nvidia-smi"

any update on official 3.x version with gpu support? We need on swarm :)

Is there any plan to add this feature?

This feature depends on docker-py implementing the device_requests parameters, which is what --gpus translates to. There have been multiple pull requests to add this feature (https://github.com/docker/docker-py/pull/2419, https://github.com/docker/docker-py/pull/2465, https://github.com/docker/docker-py/pull/2471) but there are no reactions from any maintainer. #7124 uses https://github.com/docker/docker-py/pull/2471 to provide it in Compose, but still no reply from anyone.

As I mentioned in #7124 I'm more than happy to make the PR more compliant but since it's gotten very little attention I don't want to waste my time in something that's not going to be merged ...

Please add this feature, will be awesome!

Please, add this feature! I was more than happy with the old nevidia-docker2, which allowed me to change the runtime in the daemon.json. Would be extremely nice to have this back.

Need it, please. Really need it :/

I'd like to pile on as well... we need this feature!

I need to run both CPU and GPU containers on the same machine so the default runtime hack doesn't work for me. Do we have any idea when this will work on compose? Given that that we don't have the runtime flag in compose this represents a serious functionality regression, does it not? I'm having to write scripts in order to make this work - yuck!

I need to run both CPU and GPU containers on the same machine so the default runtime hack doesn't work for me. Do we have any idea when this will work on compose? Given that that we don't have the runtime flag in compose this represents a serious functionality regression, does it not? I'm having to write scripts in order to make this work - yuck!

you can do it by docker cli (docker run --gpu ....), i have this kind of trick (by adding a proxy, to be able to communicato with other containers running on other nodes on swarm). We are all waiting for the ability to run it on swarm, because it don't work by docker service command (as i know) nor by compose.

@dottgonzo . Well, yes ;-). I am aware of this and hence the reference to scripts. But this is a pretty awful and non-portable way of doing it so I'd like to do it in a more dynamic way. As I said, I think that this represents a regression, not a feature ask.

COMPOSE_API_VERSION=auto docker-compose run gpu

@ggregoire where do we run: COMPOSE_API_VERSION=auto docker-compose run gpu ?

@joehoeller from your shell just was you would do for any other command.

Right now we are deciding for every project if we need 3.x features or if we can use docker-compose 2.x where the GPU option is still supported. Features like running multistage targets from a Dockerfile can sadly not be used if GPU is necessary. Please add this back in!

I'd like to recommend something like an "additional options" field for docker-compose where we can just add flags like --gpus=all to the docker start/run command, that are not yet/anymore supported in docker-compose but are in the latest docker version. This way, compose users won't have to wait for docker-compose to catch up if they need a new not yet supported docker feature.

Is still necessary to run this on Docker Swarm for production environments. Will this be useful por Docker Swarm?

@sebastianfelipe It's very useful if you want to deploy to your swarm using compose.
Compare:
docker service create --generic-resource "gpu=1" --replicas 10 \ --name sparkWorker <image_name> \"service ssh start && \ /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://<spark_master_ip>:7077\"

to something like this

docker stack deploy --compose-file docker-compose.yml stackdemo

Sorry, so is it already working with Docker Swarm using the docker-compose yaml file? Just to be sure :O. Thanks!

only for docker compose 2.x

The entire point of this issue is to request nvidia-docker gpu support for docker-compose 3+

It's been almost a year since the original request!! Why the delay?? Can we move this forward ??

ping @KlaasH @ulyssessouza @Goryudyuma @chris-crone . Any update on this?

For those of you looking for a workaround this what we ended up doing:

And then run COMPOSE_API_VERSION=auto docker-compose run gpu with the following file:

version: '3.7'

services:
    gpu:
        image: 'nvidia/cuda:9.0-base'
        command: 'nvidia-smi'
        device_requests:
            - capabilities:
               - "gpu"

For those of you who are as impatient as I am, here's an easy pip install version of the above workaround:

pip install git+https://github.com/docker/[email protected]/pull/2471/merge
pip install git+https://github.com/docker/[email protected]/pull/7124/merge
pip install python-dotenv

Huge kudos to @yoanisgil !
Still anxiously waiting for an official patch. With all the PRs in place, it doesn't seem difficult by any standard.

ping @KlaasH @ulyssessouza @Goryudyuma @chris-crone . Any update on this?

No, I don't know why I was called.
I want you to tell me what to do?

I hope there is an update on this.

Yeah, it's been more than a year now... why are they not merging in docker-py...

I'm not sure that the proposed implementations are the right ones for the Compose format. The good news is that we've opened up the Compose format specification with the intention of adding things like this. You can find the spec at https://github.com/compose-spec.

What I'd suggest we do is add an issue on the spec and then discuss it at one of the upcoming Compose community meetings (link to invite at the bottom of this page).

This works: docker run --gpus all nvidia/cudagl:9.2-runtime-centos7 nvidia-smi
This does not: docker run --runtime=nvidia nvidia/cudagl:9.2-runtime-centos7 nvidia-smi

You need to have

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

in your /etc/docker/daemon.json for --runtime=nvidia to continue working. More info here.

Dockerd doesn't start with this daemon.json

Christ, this is going to take years :@

This works: docker run --gpus all nvidia/cudagl:9.2-runtime-centos7 nvidia-smi
@deniswal : Yes, we know this, but we are asking about compose functionality.

@chris-crone: I'm confused: This represents a regression from former behavior, why does it need a new feature specification? Isn't it reasonable to run containers, some of which use GPU and some of which use CPU on the same physical box?

Thanks for the consideration.

@vk1z AFAIK Docker Compose has never had GPU support so this is not a regression. The part that needs design is how to declare a service's need for a GPU (or other device) in the Compose format– specifically changes like this. After that, it should just be plumbing to the backend.

Hi Guys, I've tried some solutions proposed here and nothing worked to me, for example @miriaford do not worked in my case, also is there some way to use GPU to run my existent docker containers?
I've an i7 with 16GB of ram but the build for some projects takes too long to complete, my goal is to also use GPU power to speed up the process, is that possible? Thanks!

@chris-crone : Again, I will be willing to be corrected, but wasn't that because the runtime: parameter disappeared from compose after 2.4 config? That is why I felt that it was a regression. But no, matter now since we all should be on 3.x anyway.

I'd be glad to file an issue, do we do that against the spec in the spec repo, correct?

but wasn't that because the runtime: parameter disappeared from compose after 2.4 config? That is why I felt that it was a regression.

Yes, exactly. I have a couple of projects where we rely on using runtime: nvidia in our docker-compose files, and this issue blocks us from upgrading to 3.x because we haven't found a way to use GPUs there.

Hi, please, please, please fix this.
This should be marked mission critical priority -20

Again, I will be willing to be corrected, but wasn't that because the runtime: parameter disappeared from compose after 2.4 config? That is why I felt that it was a regression. But no, matter now since we all should be on 3.x anyway.

I wasn't here when the change was made so I'm not 100 % sure why it was dropped. I know that you do not need the NVIDIA runtime to use GPUs any more and that we are evolving the Compose v3 spec in the open here with the intention of making a single version of the spec. This may mean moving some v2 functionality into v3.

In terms of the runtime field, I don't think this is how it should be added to the Compose spec as it is very specific to running on a single node. Ideally we'd want something that'd allow you to specify that your workload has a device need (e.g.: GPU, TPU, whatever comes next) and then let the orchestrator assign the workload to a node that provides that capability.

This discussion should be had on the specification though as it's not Python Docker Compose specific.

@chris-crone: I mostly concur with your statement. Adding short term hacks is probably the incorrect way to do this since we have a proliferation of edge devices each with their own runtimes. For example, as you point out, TPU (Google), VPU(Intel) and ARM GPU on the Pi. So we do need a more complete story.

I'll file an issue against the specification today and update this thread once I have done so. However, I do think that the orchestrator should be independent - such as if I want to use Kube, I should be able to do so. I'm assuming that will be in scope.

I do however, disagree with the using GPUs statement, since that doesn't work with compose - which is what this is all about. But I think we all understand what problem we would like solved.

@chris-crone : Please see the docker-compose spec issue filed. I'll follow updates against that issue from now on.

Can we simply add an option (something like extra_docker_run_args) to pass arguments directly to the underlying docker run? This will not only solve the current problem, but also be future-proof: what if docker adds support for whatever "XPU", "YPU", or any other new features that might come in the future?

If we need a long back-and-forth discussion every time docker adds a new feature, it will be extremely inefficient and cause inevitable delay (and unnecessary confusion) between docker-compose and docker updates. Supporting argument delegation can provide temporary relief for this recurrent issue for all future features.

@miriaford I'm not sure that passing an uninterpreted blob supports the compose notion of being declarative. The old runtime tag at least indicated that it was something to do with the runtime. Given the direction in which docker is trending (docker-apps), it seems to me that doing this would make declarative deployment harder since an orchestrator would have to parse arbitrary blobs.

But I agree that compose and docker should be synchronized and zapping working features that people depend on (even though it was a major release) isn't quite kosher.

@vk1z I agree - there should be a much better sync mechanism between compose and docker. However, I don't expect such mechanism to be designed any time soon. Meanwhile we also need a temporary way to do our own synchronization without hacking deep into the source code.

If the argument delegation proposal isn't an option, what do we suggest we do? I agree it isn't a pretty solution, but it's at least _much_ better than this workaround, isn't it? https://github.com/docker/compose/issues/6691#issuecomment-616984053

@miriaford docker-compose does not call the docker executive with argument, it actually uses the docker_py which uses the http API to the docker daemon. So there is no "underlying docker run" command. The docker CLI is not an API, the socket connection is the API point of contact. This is why it is not always that easy.

To over simplify things, in the process of running a docker, there are two main calls, one that creates the container, and one that starts it, each ingest different pieces of information, and knowing which is while takes someone having API knowledge, which I don't know like I we tend to know the docker CLI. I do not think being able to add extra args to docker_py calls is going to be as useful as you think, except in select use cases.

To make things even more difficult, sometimes the docker_py library is behind the API, and doesn't have everything you need right away either, and you have to wait for it to be updated. All that being said, extra_docker_run_args isn't a simple solution.

@andyneff Thanks for your explanation. Indeed, I'm not too familiar with the inner workings of Docker. If I understand correctly, there are 4 APIs that need to be manually synced for any new feature updates:

  1. Docker socket API
  2. docker_py that provides python frontend to the socket API
  3. Docker CLI (our familiar entry point to docker toolchain)
  4. Docker-compose interface that calls docker socket API

This begs the question: why is there no automatic (or at least semi-automatic) syncing mechanism? Manually propagating new feature updates across 4 APIs seems doomed to be error-prone, delay-prone, and confusing ...

P.S. I'm not saying that it's a _simple_ task to have automatic syncing, but I really think there should be one to make life easier in the future.

I'm kinda getting into pedantics now... But as I would describe it as...

  • The docker socket is THE official API for docker. It is often a file socket, but can also be TCP (or any other, I imagine using socat)
  • The docker CLI uses that API to give us users an awesome tool

    • Docker writes the API and CLI, so they are are always synced at release time. (I think that's safe to say, the CLI is a first class citizen of the docker ecosystem)

  • The docker_py library, takes that API and puts it in an awesome library that other python libraries can use. Without this you would be making all these HTTP calls yourself, and pulling your hair out.

    • However docker_py was started as a third party library, and thus it has traditionally trailed the docker API, and has things added later or as needed (limited resources).

  • compose uses a version of the docker_py and then add all these awesome features, again as needed (based on issues just like this)

    • However, compose can't do much until docker_py (which I'm not saying is holding up this issue, I don't know, I'm just talking in general)

So yes, it goes:

  • "compose yaml+compose args" -> "docker_py" -> "docker_api"
  • And the CLI isn't any part of this, (and believe me, that's the right way to do things)

I can't speak for docker_py or compose, but I imagine they have limited man hours contributing to it, so it's harder to keep up with ALL the crazy insane docker features that docker is CONSTANTLY adding. But since docker is a go library, and my understanding is that python support is not (currently) a first class citizen. Although it is nice that both projects are under the docker umbrella, at least from a github organization stand point.


So that all being said... I too am waiting for an equivalent --gpus support, and have to use the old runtime: nvidia method instead, which will at least give me "a" path to move forward in docker-compose 2.x.

@andyneff FYI there is Docker CLI support in the latest docker-compose. It allows using buildkit for instance. https://www.docker.com/blog/faster-builds-in-compose-thanks-to-buildkit-support/

@andyneff this is a very helpful overview! Thanks again

@lig awesome! Thanks for the correction! I was actually thinking "How will buildkit fit into all this" as I was writing that up

What I am a bit suprised by is that docker-compose is a pretty intrinsic part of the new docker-app framework and I'd imagine that they'd want to sync up docker-compose and docker for at least that reason. I wonder what the blocker really is: Not enough python bandwidth? Seems a bit unbelievable.

So how does Docker Swarm fit into the structure that @andyneff just described? Swarm uses the compose file format version 3 (defined by the "compose" project?) but is developed as part of docker?

Apologies if that's off-topic for this particular issue. I've rather lost track of which issue is which but I started following this because I'd like to be able to tell a service running on a swarm that it needs to use a particular runtime. We can only do that with v2 of the compose-file spec which means we can't do it with Swarm which requires v3. In other words, I'm not really interested in what the docker-compose CLI does but only in the spec defined for docker-compose.yml files that are consumed by docker swarm.

Oh swarm, the one that got away... (from me). Unfortunately that is #6239 that got closed by a BOT. :( Someone tried in #6240 but was told that...

@miriaford, it looks like there is a PR for syncing them! #6642?! (Is this just for v3???)


So because of the nature of swarm, there are certain things you do and don't do on swarm nodes. So the Docker API doesn't always allow you to do the same options on a swarm run, as a normal run. I don't know if runtime is one of these things off hand, but that is often why you can't do things in v3 (the swarm compatible version) and can in v2 (the non-swarm compatible version).

No one reading this knows what you guys are talking about.
We all are trying to deploy jellyfin w/ hardware acceleration.
Until you guys fix this back to the way its suppose to be, when it says service, 3.x is no good.
Dont use it.

You need to put 2.4 for service.
Then you can use hardware acceleration for jellyfin, ez

So come on guys, whats the ETA on this, 1 year, 2 years?

@KlaasH @ulyssessouza @Goryudyuma @chris-crone Hi, I'm working on this issue, I found that the support was missing in "docker-py", have worked on that part. Now to get it working I need to pass the configs via docker-compose.yml file. Can you help me with the schema ? i.e In order to add it should I add it to a new schema or is there any place where the configs could be passed

@fakabbir I would assume it is ok to just use COMPOSE_DOCKER_CLI_BUILD for this. Adding an ability to provide and arbitrary list of docker run arguments could even help to avoid similar issues in the future.

@lig how do you deal when only one service requires access to a GPU?

@lig AFAICS compose uses docker-py instead of the docker run cli. So adding an arbitrary docker run arguments wouldn't work unless docker-py supports it as well.

ref: https://github.com/docker/compose/issues/6691#issuecomment-585199425

This single thing brings down the usefulness of docker-compose hugely for many people. That it hasn't seen much attention and desire to fix it, especially when it worked in older docker-compose, is quite astonishing.
Wouldn't one way to go be to allow arbitrary docker --run arguments to be given in a docker-compose file? Then --gpus all for instance could be passed to docker.

I understand there can be philosophical or technical reasons why one might want to do it in a particular way. But not getting hands on and doing it in ANY way staggers the mind.

@lig how do you deal when only one service requires access to a GPU?

Well the environment variable NVIDIA_VISIBLE_DEVICES will allow you to control that no?

This single thing brings down the usefulness of docker-compose hugely for many people. That it hasn't seen much attention and desire to fix it, especially when it worked in older docker-compose, is quite astonishing.
Wouldn't one way to go be to allow arbitrary docker --run arguments to be given in a docker-compose file? Then --gpus all for instance could be passed to docker.

I don't think to allow passing docker --run args is the way to go. compose does not really call docker by itself but instead uses docker-py.

I understand there can be philosophical or technical reasons why one might want to do it in a particular way. But not getting hands on and doing it in ANY way staggers the mind.

A PR is open about it: https://github.com/docker/compose/pull/7124. Please feel free to "get your hands on it".

I believe that as per change in docker compose spec, we should be back soon to earlier compatibility as per compose 2.4 and it the nvidia runtime will work. It obviously won't work for TPUs or other accelerators - which is very unfortunate but for those who want to run (expensive) nvidia gpus, it will work.

So just waiting on a green PR in docker-py to be merged https://github.com/docker/docker-py/pull/2471

YEAH! The PR over at docker-py has been approved! https://github.com/docker/docker-py/pull/2471
What the next step here?

What's up here ? It would be cool to be able to support nvidia runtime in docker-compose

Now that docker/docker-py#2471 has been merged we can install the docker-py from master. But since the docker-compose has changed since @yoanisgil 's cool [PR] (https://github.com/docker/compose/pull/7124) (Kudos!), it is unlikely to get merged. So at this point, the docker-compose can be installed from that PR to save the day.

For those who ended up here without seeing the previous comments:

pip install git+https://github.com/docker/docker-py.git
pip install git+https://github.com/yoanisgil/[email protected]

Then use the following template in your compose file. (source: comment):

And then run COMPOSE_API_VERSION=auto docker-compose run gpu with the following file:

version: '3.7'

services:
    gpu:
        image: 'nvidia/cuda:9.0-base'
        command: 'nvidia-smi'
        device_requests:
            - capabilities:
               - "gpu"

I confirm that this worked on my local machine. Don't know it works with Swarm.

Can't have a particular commit of docker-compose in production. Does #7124 need to be rebased or is there another PR thats going to incorporate the new docker-py?

Hi there @bkakilli,

Thanks for the help! I just tried your suggestion, but I get an error running my docker-compose

ERROR: The Compose file './docker-compose.yml' is invalid because:
Unsupported config option for services.analysis: 'device_requests'

_analysis being my container's name_

I changed my docker-compose.yml from:

version: '2.3'

services:
    analysis:
        container_name: analysis
        image: analysis:${TAG}
        runtime: nvidia
        restart: always
        ports:
            - "8000:80"

to:

version: '3.7'

services:
    analysis:
        container_name: analysis
        image: analysis:${TAG}
        device_requests:
          - capabilities:
            - "gpu"
        restart: always
        ports:
            - "8000:80"

Is there anything else apart from both pip install git+ to correctly set this up? Or perhaps I edited the configuration file badly?

@frgfm make sure you're installing compose and docker-py from correct links. You may have used the docker-compose's own repo instead of yoanisgil's fork (and branch). See if you're using the following link:

pip install git+https://github.com/yoanisgil/[email protected]

You may try putting --upgrade param to pip install. Otherwise I would suspect the virtual environment settings. Maybe you have another docker-compose installation, which is being used by default? E.g you may have installed it for the system with the "Linux" instructions here: https://docs.docker.com/compose/install/. I suggest you to take a look at "Alternative Install Options" and installing via pip in the virtual environment (but use pip install command above. Don't install the default docker-compose from PyPI).

Hi!
Thanks for all the info. I was trying to run your approach @bkakilli and docker-compose build worked but when running docker-compose up I got the error:
docker.errors.InvalidVersion: device_requests param is not supported in API versions < 1.40

My docker_compose.yml looks like this:

version: '3.7'

networks:
  isolation-network:
    driver: bridge

services:
 li_t5_service:
    build: .
    ports:
      - "${GRAPH_QL_API_PORT}:5001"
    device_requests:
      - capabilities:
        - "gpu"
    environment:
      - SSH_PRIVATE_KEY=${SSH_PRIVATE_KEY}
      - PYTHONUNBUFFERED=${PYTHONUNBUFFERED}
    networks: 
      - isolation-network

Thanks in advance!

@ugmSorcero Set the environment variable COMPOSE_API_VERSION=1.40 then re-run your commands

@ugmSorcero did you manage to fix that error? @EpicWink @bkakilli I'm running the version stated from the pip install but I still get the error for device_requests param is not supported in API versions < 1.40 even if I export such variable to 1.40

For the given compose file

version: "3.7"
services:
  spam:
    image: nvidia/cuda:10.1-cudnn7-runtime
    command: nvidia-smi
    device_requests:
      - capabilities:
          - gpu

Using the version of docker-compose installed as above, in Bash on Linux, the following command succeeds:

COMPOSE_API_VERSION=1.40 docker-compose up

The following command fails:

docker-compose up

This has error output:

ERROR: for tmp_spam_1  device_requests param is not supported in API versions < 1.40
...
docker.errors.InvalidVersion: device_requests param is not supported in API versions < 1.40

@EpicWink thank you very much. I didn't realize that docker-compose up had to be executed that way. I took it as a 2 step where first I exported COMPOSE_API_VERSION separately. Running it together seems to work :)

I have another issue, though. If I run COMPOSE_API_VERSION=1.40 docker-compose run nvidiatest then nvidia-smi is not found in the path, while if I run directly from the image there is no issue.

Here's how I'm reproducing it.

docker-compose local file contains:

nvidiatest:
    image: nvidia/cuda:10.0-base
    device_requests:
      - capabilities:
        - gpu
    command: nvidia-smi

If I run my current setup (both api version auto and 1.40) I get the following error:

COMPOSE_API_VERSION=auto docker-compose -f docker-compose.yml -f docker-compose.local.yml run nvidiatest
Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown

Is it possible that it has to do with using override files? If I just run the cuda base image with Docker there's no problem with getting output from nvidia-smi:

docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Mon Aug 24 11:40:04 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:29:00.0  On |                  N/A |
|  0%   46C    P8    19W / 175W |    427MiB /  7974MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I installed docker-compose following the instructions above from git after uninstalling the version installed from the official docs. Here's the info of the version installed:

pip3 show --verbose docker-compose
Name: docker-compose
Version: 1.26.0.dev0
Summary: Multi-container orchestration for Docker
Home-page: https://www.docker.com/
Author: Docker, Inc.
Author-email: None
License: Apache License 2.0
Location: /home/jurugu/.local/lib/python3.8/site-packages
Requires: docopt, docker, requests, PyYAML, texttable, websocket-client, six, dockerpty, jsonschema, cached-property
Required-by:
Metadata-Version: 2.1
Installer: pip
Classifiers:
  Development Status :: 5 - Production/Stable
  Environment :: Console
  Intended Audience :: Developers
  License :: OSI Approved :: Apache Software License
  Programming Language :: Python :: 2
  Programming Language :: Python :: 2.7
  Programming Language :: Python :: 3
  Programming Language :: Python :: 3.4
  Programming Language :: Python :: 3.6
  Programming Language :: Python :: 3.7
Entry-points:
  [console_scripts]
  docker-compose = compose.cli.main:main

Am I missing anything? Thanks for the help!

@jjrugui this is becoming off-topic, and I'm not able to replicate your issue. Sorry for not being able to help

@EpicWink not a problem, and sorry for deviating from the topic :). If I figure out my particular issue I'll post it here if it's relevant.

Is someone working on another PR or are we debugging the device-requests branch in order to get ready for a PR?

While the PR is stuck, I ported changes from #7124 to the latest version from the master branch to match dependencies, etc. - https://github.com/beehiveai/compose You can install with pip install git+https://github.com/beehiveai/compose.git and change the version in docker-compose.yml to 3.8:

version: "3.8"
services:
  gpu-test:
    image: nvidia/cuda:10.2-runtime
    command: nvidia-smi
    device_requests:
      - capabilities:
          - gpu

In this setting, everything works as expected.

As discussed yesterday on compose-spec governance meeting, we will start working on a proposal to adopt something comparable to #7124, which could be close to generic_resouces already available on deploy section.

@ndeloof That is great! If it is possible, please post the link to the proposal here. I think many people would be happy to contribute to this since GPU support is critical for deep learning deployments.

@ndeloof historically, how long does it take the steering committee to make a decision, 6 months, a year?

+1

+1

@visheratin Any chance you can improve your fix so that it works when using multiple compose yml files? I have a base docker-compose.yml that uses a non-nvidia container, that I want to override with nvidia container when there is a GPU, however it seems that with your fix, if I specify multiple compose yml files with the "-f", the "device_requests" fields drops out of the config.

@proximous What do you mean by "drops out of the config"? Do all compose files have version 3.8? Can you share the example so it would be easier to reproduce?

Having a problem with the code in compose/service.py when trying to use the --scale option with docker-compose up. Is this not supported?

Traceback (most recent call last):
File "/usr/local/bin/docker-compose", line 11, in
load_entry_point('docker-compose==1.27.0.dev0', 'console_scripts', 'docker-compose')()
File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 67, in main
command()
File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 123, in perform_command
handler(command, command_options)
File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 1067, in up
to_attach = up(False)
File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 1063, in up
cli=native_builder,
File "/usr/local/lib/python3.6/site-packages/compose/project.py", line 648, in up
get_deps,
File "/usr/local/lib/python3.6/site-packages/compose/parallel.py", line 108, in parallel_execute
raise error_to_reraise
File "/usr/local/lib/python3.6/site-packages/compose/parallel.py", line 206, in producer
result = func(obj)
File "/usr/local/lib/python3.6/site-packages/compose/project.py", line 634, in do
override_options=override_options,
File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 579, in execute_convergence_plan
renew_anonymous_volumes,
File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 509, in _execute_convergence_recreate
scale - len(containers), detached, start
File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 479, in _execute_convergence_create
"Creating"
File "/usr/local/lib/python3.6/site-packages/compose/parallel.py", line 108, in parallel_execute
raise error_to_reraise
File "/usr/local/lib/python3.6/site-packages/compose/parallel.py", line 206, in producer
result = func(obj)
File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 477, in
lambda service_name: create_and_start(self, service_name.number),
File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 456, in create_and_start
container = service.create_container(number=n, quiet=True)
File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 333, in create_container
previous_container=previous_container,
File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 936, in _get_container_create_options
one_off=one_off)
File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 1014, in _get_container_host_config
element.split(',') for element in device_request['capabilities']]
File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 1014, in
element.split(',') for element in device_request['capabilities']]
AttributeError: 'list' object has no attribute 'split'

After further debugging, I found that when using the --scale, that for some reason one instance has the device_requests['capabilities'] as ['gpu']. But for all other containers to be started, the device_request['capabilities'] instead looks like [['gpu']].

I made a temporary fix locally to get around this issue just to get my containers up and running starting at line 1010 in compose/service.py:

```
for device_request in device_requests:
if 'capabilities' not in device_request:
continue
if type(device_request['capabilities'][0]) == list:
device_request['capabilities'] = [
element.split('.') for element in device_request['capabilities'][0]]
else:
device_request['capabilities'] = [
element.split('.') for element in device_request['capabilities']]

````

@proximous What do you mean by "drops out of the config"? Do all compose files have version 3.8? Can you share the example so it would be easier to reproduce?

@visheratin see this example, am I wrong to expect a different result?

docker-compose.nogpu.yml:

version: '3.8'

services:
  df:
    build: miniconda-image.Dockerfile

docker-compose.gpu.yml:

version: '3.8'

services:
  df:
    build: nvidia-image.Dockerfile
    device_requests:
      - capabilities:
          - gpu

use only the nogpu.yml:

$ docker-compose -f docker-compose.nogpu.yml config
services:
  df:
    build:
      context: /home/jerry/gpu-test/miniconda-image.Dockerfile
version: '3'

use only the gpu.yml:

$ docker-compose -f docker-compose.gpu.yml config
services:
  df:
    build:
      context: /home/jerry/gpu-test/nvidia-image.Dockerfile
    device_requests:
    - capabilities:
      - gpu
version: '3'

chain config ymls starting with a non-gpu yml (note...missing the runtime):

$ docker-compose -f docker-compose.nogpu.yml -f docker-compose.gpu.yml config
services:
  df:
    build:
      context: /home/jerry/gpu-test/nvidia-image.Dockerfile
version: '3'

expected output:

$ docker-compose -f docker-compose.nogpu.yml -f docker-compose.gpu.yml config
services:
  df:
    build:
      context: /home/jerry/gpu-test/nvidia-image.Dockerfile
    device_requests:
      - capabilities:
          - gpu
version: '3'

(Obviously I'm trying to something more elaborate and this is just a simplified case to highlight the unexpected behavior.)

@jlaule @proximous In order to keep this thread on topic, please create issues in the forked repo, I will look into them when I have time.

For those who need something while waiting, i just setup K3S (edge version of Kubernetes) with GPU support in 30mins using docker as a container run time (i.e. use the --docker option to the install script). Follow https://github.com/NVIDIA/k8s-device-plugin to get the Nvidia device plugin working.
Hope that helps!

@EpicWink not a problem, and sorry for deviating from the topic :). If I figure out my particular issue I'll post it here if it's relevant.

Did you ever resolve this?

There is no such thing like "/usr/bin/nvidia-container-runtime" anymore. Issue is still critical.

Install nvidia-docker2 as instruceted here

ive been tackling This lately and thought id share my approach.
my problem was that i needed to docker stack deploy and it didnt want to listen. docker compose i had working with the docker api version hack but it didnt feel right and stack deploy wouldnt work regardless.

so without setting any run time pr device requests in my docker compose, i added This to my daemon:

{ "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "default-runtime": "nvidia", "node-generic-resources": [ "NVIDIA-GPU=0" ] }

u can also use GPU-{first part of gpu guid}
but This was easier. didnt have to install any pip+ or anything like that except the NV container toolkit. it deploys and works like a charm.

Tks a lot @haviduck , just tried on my own machine (Ubuntu 20.04, docker CE 19.03.8) and it worked like a charm.
For others: don't forget to restart your docker daemon.

@pommedeterresautee ah im so glad it worked for others! should have mentioned the reload.

gotta say after 3 weeks of non stop dockering im pretty baffled how nothing seems to work..

@haviduck: Thank you! Finally a simple solution that just works. I have spent so much time trying to add devices etc that I gave up. Then this comes along, tries it and after a couple of minutes I have hardware transcoding in Plex working.

I will add my 2 cents to this matter... I read lots of post and finally solution was pretty simple.

it worked for me with: (maybe a bit lower would have worked also - no idea...)
docker-compose version 1.27.4, build 40524192

  1. on docker host machine install nvidia-container-toolkit and nvidia-container-runtime packages
  2. on docker host machine: type
    nvidia-smi and examine the CUDA version that appears on the right
    image
  3. on docker host machine: type : (replace the cuda version with the one you have installed)
    docker run --rm --gpus all nvidia/cuda:10.1-base nvidia-smi
    you should get the same output as you did when you ran the nvidia-smi on the host machine
  4. in file /etc/docker/daemon.json you should se:
    "runtimes": {"nvidia": { "path": "/usr/bin/nvidia-container-runtime","runtimeArgs": [] } }
  5. in your docker-compose YML you should add:
    runtime: nvidia

that's it !
deploy using the YML and you will have GPU support in your docker-compose

5. in your docker-compose YML you should add:
   **runtime: nvidia**

Oh boy, this whole thread is about version 3 which does not have runtime.

Releases 1.27.0+ have merged v2/v3 file formats, so one can use runtime anywhere now. Additionally, the specification for accelerators got landed too (https://github.com/compose-spec/compose-spec/pull/100) although not implemented in docker-compose yet.

I will add my 2 cents to this matter... I read lots of post and finally solution was pretty simple.

it worked for me with: (maybe a bit lower would have worked also - no idea...)
docker-compose version 1.27.4, build 4052419

  1. on docker host machine install nvidia-container-toolkit and nvidia-container-runtime packages
  2. on docker host machine: type
    nvidia-smi and examine the CUDA version that appears on the right
    image
  3. on docker host machine: type : (replace the cuda version with the one you have installed)
    docker run --rm --gpus all nvidia/cuda:10.1-base nvidia-smi
    you should get the same output as you did when you ran the nvidia-smi on the host machine
  4. in file /etc/docker/daemon.json you should se:
    "runtimes": {"nvidia": { "path": "/usr/bin/nvidia-container-runtime","runtimeArgs": [] } }
  5. in your docker-compose YML you should add:
    runtime: nvidia

that's it !
deploy using the YML and you will have GPU support in your docker-compose

FYI, the cuda version you can see in the output of nvidia-smi refers to the nvidia cuda driver version, aka your nvidia driver (they call it cuda too, which is confusing). The version number in docker image, eg nvidia/cuda:10.1-base nvidia-smi refers to the nvidia toolkit version (again, confusingly the same version numbering system, different beasts).

The driver is backwards compatible with the toolkit, so you can run any nvidia/cuda:<version>-base nvidia-smi you wish, as long as <version> is smaller or equal to the driver version.

More info here: https://stackoverflow.com/questions/53422407/different-cuda-versions-shown-by-nvcc-and-nvidia-smi

I don't see Docker Compose 1.27.4 binaries available for ARM-based system.

pip install docker-compose
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: docker-compose in /usr/local/lib/python3.6/dist-packages (1.26.2)

Hi @collabnix!

What's the output of pip --version?

Compose 1.27 dropped support for Python 2 so it's possible you don't see the 1.27.x releases if your system has Python 2.

@collabnix that's because you have compose already installed, try pip install --upgrade docker-compose

Now I can upgrade to 1.27.4. The pip upgrade did the trick. Thanks @kshcherban & @chris-crone
It's crazy to see most of the projects I worked on the past and which uses Python 2.7 really need an upgrade.

Upgrade docker-compose to 1.27.4 solved the problem.
(may be upgrade to 1.19.3 later solve the problem)

$ cat /etc/docker/daemon.json 
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
version: '3.5'

    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all       
    image: 'detectme'
    ipc: 'host'
    tty: true
    stdin_open: true
$ sudo docker-compose up -d --build
ERROR: The Compose file './docker-compose.yml' is invalid because:
Unsupported config option for services.engine: 'runtime'
$ docker-compose --version
docker-compose version 1.17.1, build unknown
$ sudo curl -L "https://github.com/docker/compose/releases/download/1.27.4/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
$ sudo chmod +x /usr/local/bin/docker-compose
$ sudo rm /usr/bin/docker-compose
$ sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose

$ sudo docker-compose up -d --build
run well with docker-compose.
on host computer check gpu used by nvidia-smi.

@bttung-2020
@PyCod
If I use the dokcer-nvidia-devel(not runtime):nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 in docker-hub
IS it work? How I should edit the docker-compose.yaml file?

Was this page helpful?
4 / 5 - 1 ratings