Sistem Operasi: Ubuntu 14.04
Versi CUDA dan cuDNN yang diinstal: 7.5 dan 4.0.7
(harap lampirkan output dari ls -l /path/to/cuda/lib/libcud*
):
Jika diinstal dari sumber, berikan hash komit: 4a4f2461533847dde239851ecebe5056088a828c
Jalankan kode berikut
import tensorflow as tf
def main():
a = tf.Variable(1)
init_a = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init_a)
with tf.device("/gpu:0"):
b = tf.constant(2)
init_b = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init_b)
with tf.device("/cpu:0"):
c = tf.Variable(2)
init_c = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init_c)
with tf.device("/gpu:0"):
d = tf.Variable(2)
init_d = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init_d)
if __name__ == '__main__':
main()
(Jika log besar, harap unggah sebagai lampiran).
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.266
pciBusID 0000:05:00.0
Total memory: 12.00GiB
Free memory: 11.02GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 1 with properties:
name: GeForce GTX 980
major: 5 minor: 2 memoryClockRate (GHz) 1.2785
pciBusID 0000:09:00.0
Total memory: 4.00GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:59] cannot enable peer access from device ordinal 0 to device ordinal 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:59] cannot enable peer access from device ordinal 1 to device ordinal 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y N
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 1: N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:05:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:756] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 980, pci bus id: 0000:09:00.0)
Traceback (most recent call last):
File "test_multi_gpu.py", line 30, in <module>
main()
File "test_multi_gpu.py", line 26, in main
sess.run(init_d)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 332, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 572, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 652, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 672, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'Variable_2': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available
[[Node: Variable_2 = Variable[container="", dtype=DT_INT32, shape=[], shared_name="", _device="/device:GPU:0"]()]]
Caused by op u'Variable_2', defined at:
File "test_multi_gpu.py", line 30, in <module>
main()
File "test_multi_gpu.py", line 23, in main
d = tf.Variable(2)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 211, in __init__
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 292, in _init_from_args
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/state_ops.py", line 139, in variable_op
container=container, shared_name=shared_name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 351, in _variable
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 693, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2177, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1161, in __init__
self._traceback = _extract_stack()
Saya juga memperhatikan bahwa dokumentasi untuk Menggunakan GPU tidak menyebutkan tentang tf.Variable, hanya melibatkan tf.constant dan tf.matmul.
Oke, saya menemukan dokumentasi dari [Convolutional Neural Networks] (https://www.tensorflow.org/versions/r0.8/tutorials/deep_cnn/index.html),
kutipan:
All variables are pinned to the CPU and accessed via tf.get_variable() in order to share them in a multi-GPU version. See how-to on Sharing Variables.
Saya ingin menanyakan bahwa karena tf.Variables disematkan ke CPU oleh tensorflow, dapatkah kami memperbaiki kesalahan ini? Apakah kita perlu melihat dengan sangat hati-hati untuk mengecualikan deklarasi tf.Variable di luar lingkup with tf.device('/gpu:xx')
, atau menggunakan netsted with tf.device(None)
untuk menanganinya?
Jadi, ada beberapa operasi yang tidak valid untuk tf.device(), seperti tf.nn.local_response_normalization(),
Lihat kode di bawah ini:
with tf.device("/gpu:0"):
d = tf.placeholder("float", shape=[100, 100, 100, 10])
with tf.device(None):
lrn1 = tf.nn.local_response_normalization(d, depth_radius=5, bias=1.0, alpha=1e-4, beta=0.75)
lrn2 = tf.nn.local_response_normalization(d, depth_radius=5, bias=1.0, alpha=1e-4, beta=0.75)
init_d = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init_d)
r = np.random.randn(100, 100, 100, 10)
sess.run(lrn1, feed_dict={d: r}) #Run ok
sess.run(lrn2, feed_dict={d: r}) # Error
Outputnya di bawah ini:
Traceback (most recent call last):
File "test_multi_gpu.py", line 44, in <module>
main()
File "test_multi_gpu.py", line 40, in main
sess.run(lrn2, feed_dict={d: r})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 332, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 572, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 652, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 672, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'LRN_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available
[[Node: LRN_1 = LRN[alpha=0.0001, beta=0.75, bias=1, depth_radius=5, _device="/device:GPU:0"](Placeholder)]]
Caused by op u'LRN_1', defined at:
File "test_multi_gpu.py", line 44, in <module>
main()
File "test_multi_gpu.py", line 34, in main
lrn2 = tf.nn.local_response_normalization(d, depth_radius=5, bias=1.0, alpha=1e-4, beta=0.75)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 737, in lrn
bias=bias, alpha=alpha, beta=beta, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 693, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2177, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1161, in __init__
self._traceback = _extract_stack()
Alasan kesalahan ini mungkin cukup jelas menurut saya. Ada beberapa tf.Variable internal dalam tf.nn.local_response_normalization
yang tidak dapat kami gunakan di luar kode untuk tetap menjadi simpul komputasi ke gpu yang ditentukan sambil mengecualikan semua variabel internal.
Untuk saat ini, saya pikir tensorflow harus melakukan salah satu dari dua hal di bawah ini:
tf.device(None)
untuk membantu pengguna menyelesaikan kode mereka, bukan?Masalah tingkat tinggi harus diperbaiki dengan pekerjaan berkelanjutan @vrv untuk meningkatkan penempatan perangkat. (Membuat tf.Variable
mengabaikan tf.device()
tidak akan berhasil, karena banyak pengguna kami, terutama dalam pengaturan terdistribusi, menggunakan ini untuk mengonfigurasi server parameter.) Dalam jangka pendek, coba gunakan penempatan lunak di sesi Anda konstruktor:
config = tf.ConfigProto(allow_soft_placement=True)
with tf.Session(config=config) as sess:
# ...
Terima kasih atas saran Anda, sepertinya menggunakan allow_soft_placement=True
akan memperbaiki masalah. Seperti yang dinyatakan dalam #2292 , lebih baik untuk meningkatkan dokumen yang sesuai agar pengguna mengetahui hal ini.
Komentar yang paling membantu
Masalah tingkat tinggi harus diperbaiki dengan pekerjaan berkelanjutan @vrv untuk meningkatkan penempatan perangkat. (Membuat
tf.Variable
mengabaikantf.device()
tidak akan berhasil, karena banyak pengguna kami, terutama dalam pengaturan terdistribusi, menggunakan ini untuk mengonfigurasi server parameter.) Dalam jangka pendek, coba gunakan penempatan lunak di sesi Anda konstruktor: