Tensorflow: CPU에 고정된 tf.Variable과 관련된 여러 GPU 사용 버그

에 만든 2016년 05월 09일 · 3코멘트 · 출처: tensorflow/tensorflow

환경 정보

운영 체제: 우분투 14.04

설치된 CUDA 및 cuDNN 버전: 7.5 및 4.0.7
( ls -l /path/to/cuda/lib/libcud* 의 출력을 첨부하십시오):

소스에서 설치된 경우 커밋 해시 제공: 4a4f2461533847dde239851ecebe5056088a828c

재현 단계

다음 코드를 실행

import tensorflow as tf

def main():
    a = tf.Variable(1)
    init_a = tf.initialize_all_variables()
    with tf.Session() as sess:
        sess.run(init_a)

    with tf.device("/gpu:0"):
        b = tf.constant(2)
        init_b = tf.initialize_all_variables()
    with tf.Session() as sess:
        sess.run(init_b)

    with tf.device("/cpu:0"):
        c = tf.Variable(2)
        init_c = tf.initialize_all_variables()
    with tf.Session() as sess:
        sess.run(init_c)

    with tf.device("/gpu:0"):
        d = tf.Variable(2)
        init_d = tf.initialize_all_variables()
    with tf.Session() as sess:
        sess.run(init_d)

if __name__ == '__main__':
    main()

도움이 될 로그 또는 기타 출력

(로그 용량이 클 경우 첨부파일로 업로드 부탁드립니다.)

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.266
pciBusID 0000:05:00.0
Total memory: 12.00GiB
Free memory: 11.02GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties: 
name: GeForce GTX 980
major: 5 minor: 2 memoryClockRate (GHz) 1.2785
pciBusID 0000:09:00.0
Total memory: 4.00GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:59] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:59] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y N 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 1:   N Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:09:00.0)
Traceback (most recent call last):
  File "test_multi_gpu.py", line 30, in <module>
    main()
  File "test_multi_gpu.py", line 26, in main
    sess.run(init_d)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 332, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 572, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 652, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 672, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'Variable_2': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available
     [[Node: Variable_2 = Variable[container="", dtype=DT_INT32, shape=[], shared_name="", _device="/device:GPU:0"]()]]
Caused by op u'Variable_2', defined at:
  File "test_multi_gpu.py", line 30, in <module>
    main()
  File "test_multi_gpu.py", line 23, in main
    d = tf.Variable(2)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 211, in __init__
    dtype=dtype)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 292, in _init_from_args
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/state_ops.py", line 139, in variable_op
    container=container, shared_name=shared_name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 351, in _variable
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 693, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2177, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1161, in __init__
    self._traceback = _extract_stack()

또한 GPU 사용에 대한 문서에서는 tf.Variable에 대해 언급하지 않고 tf.constant 및 tf.matmul만 관련되어 있음을 알았습니다.

알겠습니다. [Convolutional Neural Networks] (https://www.tensorflow.org/versions/r0.8/tutorials/deep_cnn/index.html)
인용 부호:

All variables are pinned to the CPU and accessed via tf.get_variable() in order to share them in a multi-GPU version. See how-to on Sharing Variables.

tf.Variables가 tensorflow에 의해 CPU에 고정되어 있기 때문에 이 오류를 수정할 수 있는지 묻고 싶습니다. with tf.device('/gpu:xx') 범위 밖의 tf.Variable 선언을 제외하기 위해 매우 주의 깊게 살펴봐야 합니까, 아니면 netsted with tf.device(None) 를 사용하여 처리해야 합니까?

출처

myme5261314

👍2

가장 유용한 댓글

높은 수준의 문제는 장치 배치를 개선하기 위한 @vrv 의 지속적인 작업으로 해결되어야 합니다. ( tf.Variable tf.device() tf.Variable 무시하도록 만드는 것은 작동하지 않습니다. 특히 분산 설정에서 많은 사용자가 이것을 사용하여 매개변수 서버를 구성하기 때문입니다.) 단기적으로는 세션에서 소프트 배치를 사용해 보십시오. 건설자:

config = tf.ConfigProto(allow_soft_placement=True)
with tf.Session(config=config) as sess:
    # ...

mrry 에 2016년 05월 09일

👍23

모든 3 댓글

따라서 tf.nn.local_response_normalization()과 같이 tf.device()에 유효하지 않은 작업이 있습니다.
아래 코드를 참조하세요.

    with tf.device("/gpu:0"):
        d = tf.placeholder("float", shape=[100, 100, 100, 10])
        with tf.device(None):
            lrn1 = tf.nn.local_response_normalization(d, depth_radius=5, bias=1.0, alpha=1e-4, beta=0.75)
        lrn2 = tf.nn.local_response_normalization(d, depth_radius=5, bias=1.0, alpha=1e-4, beta=0.75)
        init_d = tf.initialize_all_variables()
    with tf.Session() as sess:
        sess.run(init_d)
        r = np.random.randn(100, 100, 100, 10)
        sess.run(lrn1, feed_dict={d: r}) #Run ok
        sess.run(lrn2, feed_dict={d: r}) # Error

출력은 아래와 같습니다.

Traceback (most recent call last):
  File "test_multi_gpu.py", line 44, in <module>
    main()
  File "test_multi_gpu.py", line 40, in main
    sess.run(lrn2, feed_dict={d: r})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 332, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 572, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 652, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 672, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'LRN_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available
     [[Node: LRN_1 = LRN[alpha=0.0001, beta=0.75, bias=1, depth_radius=5, _device="/device:GPU:0"](Placeholder)]]
Caused by op u'LRN_1', defined at:
  File "test_multi_gpu.py", line 44, in <module>
    main()
  File "test_multi_gpu.py", line 34, in main
    lrn2 = tf.nn.local_response_normalization(d, depth_radius=5, bias=1.0, alpha=1e-4, beta=0.75)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 737, in lrn
    bias=bias, alpha=alpha, beta=beta, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 693, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2177, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1161, in __init__
    self._traceback = _extract_stack()

이 오류 의 원인 은 충분히 분명하다고 생각합니다. tf.nn.local_response_normalization 내부 tf.Variable이 있습니다. 외부 코드를 사용하여 모든 내부 변수를 제외하는 동안 지정된 GPU에 대한 계산 노드를 유지할 수 없습니다.

현재로서는 tensorflow가 아래 두 가지 중 하나를 수행해야 한다고 생각합니다.

tf.device()의 영향을 받지 않도록 tf.Variable을 만듭니다. (이것이 선호될 수 있습니다.)
사용자가 코드를 완성하는 데 도움이 되도록 tf.device(None) 를 사용해야 하는 작업을 나열하십시오. 맞습니까?

myme5261314 에 2016년 05월 09일

👍2

config = tf.ConfigProto(allow_soft_placement=True)
with tf.Session(config=config) as sess:
    # ...

mrry 에 2016년 05월 09일

👍23

제안해 주셔서 감사합니다. allow_soft_placement=True 를 사용하면 문제가 해결되는 것 같습니다. #2292에서 언급했듯이 사용자가 이것을 알 수 있도록 해당 문서를 개선하는 것이 좋습니다.

myme5261314 에 2016년 05월 11일

👍9

이 페이지가 도움이 되었나요?

0 / 5 - 0 등급