Pytorch: ๋ฐ์ดํ„ฐ ๋กœ๋”์—์„œ ๊ฐ€๋Šฅํ•œ ๊ต์ฐฉ ์ƒํƒœ

์— ๋งŒ๋“  2017๋…„ 04์›” 25์ผ  ยท  189์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: pytorch/pytorch

๋ฒ„๊ทธ๋Š” pytorch/examples#148์— ์„ค๋ช…๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ์ œ ์ฝ”๋“œ๊ฐ€ ๊นจ๋—ํ•ด ๋ณด์ด๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์ด PyTorch ์ž์ฒด์˜ ๋ฒ„๊ทธ์ธ์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์ด๊ฒƒ์ด #1120๊ณผ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

๋น„์Šทํ•œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•œ ์—ํฌํฌ๋ฅผ ๋งˆ์น˜๋ฉด ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์ค‘์ง€๋˜๊ณ  ์ƒˆ ์—ํฌํฌ๊ฐ€ ์‹œ์ž‘๋ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  189 ๋Œ“๊ธ€

๋กœ๋”๊ฐ€ ์ค‘์ง€๋˜๋ฉด ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ์—ฌ์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ?

@apaszke top ๋ฅผ ํ™•์ธํ•˜๋ฉด ๋‚˜๋จธ์ง€ ๋ฉ”๋ชจ๋ฆฌ(์บ์‹œ๋œ ๋ฉ”๋ชจ๋ฆฌ๋„ ์‚ฌ์šฉ๋œ ๊ฒƒ์œผ๋กœ ๊ณ„์‚ฐ๋จ)๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ 2GB์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์บ์‹œ๋œ ๊ฒƒ์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์œผ๋กœ ๊ณ„์‚ฐํ•˜์ง€ ์•Š์œผ๋ฉด ํ•ญ์ƒ 30GB ์ด์ƒ์ž…๋‹ˆ๋‹ค.

๋˜ํ•œ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์‹œ์ž‘ ์‹œ ํ•ญ์ƒ ์ค‘์ง€๋˜์ง€๋งŒ ๋‹ค๋ฅธ ๊ณณ์—์„œ๋Š” ์ค‘์ง€๋˜์ง€ ์•Š๋Š” ์ด์œ ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์•„๋งˆ๋„ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ๋ฅผ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ œํ•œ์„ ์ดˆ๊ณผํ•˜๋Š” ๋ณ„๋„์˜ ๋กœ๋”๊ฐ€ ์‚ฌ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

@ngimel

๋ฐฉ๊ธˆ ํ”„๋กœ๊ทธ๋žจ์„ ๋‹ค์‹œ ์‹คํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ง‰ํ˜”์Šต๋‹ˆ๋‹ค.

top ์ถœ๋ ฅ:

~~~
์ƒ๋‹จ - 17:51:18 ์ตœ๋Œ€ 2์ผ, 21:05, 2๋ช…์˜ ์‚ฌ์šฉ์ž, ๋กœ๋“œ ํ‰๊ท : 0.49, 3.00, 5.41
์ž‘์—…: ์ด 357๊ฐœ, ์‹คํ–‰ 2๊ฐœ, ์ž ์ž๊ธฐ 355๊ฐœ, ์ค‘์ง€ 0๊ฐœ, ์ข€๋น„ 0๊ฐœ
%Cpu(s): 1.9 us, 0.1 sy, 0.7 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: ์ด 65863816๊ฐœ, 60115084๊ฐœ ์‚ฌ์šฉ๋จ, 5748732๊ฐœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ, 1372688๊ฐœ ๋ฒ„ํผ
KiB ์Šค์™‘: ์ด 5917692, 620 ์‚ฌ์šฉ, 5917072 ๋ฌด๋ฃŒ. 51154784 ์บ์‹œ๋œ ๋ฉ”๋ชจ๋ฆฌ

PID ์‚ฌ์šฉ์ž PR NI VIRT RES SHR S %CPU %MEM TIME+ ๋ช…๋ น 3067 aalreja 20 0 143332 101816 21300 R 46.1 0.2 1631:44 Xvnc
16613 ์•Œ๋ ˆ์ž 30 10 32836 4880 3912 S 16.9 0.0 1:06.92 fiberlamp 3221 ์•Œ๋ ˆ์ž 20 0 8882348 1.017g 110120 S 1.579 1.6 MATLAB
1285 ๋ฃจํŠธ 20 0 1404848 48252 25580 S 0.3 0.1 6:00.12 dockerd 16597 ymengz+ 20 0 25084 3252 2572 R 0.3 0.5 6 0:0
1 ๋ฃจํŠธ 20 0 33616 4008 2624 S 0.0 0.0 0:01.43 ์ดˆ๊ธฐํ™”
~~~

free ์ถœ๋ ฅ

~yimengzh_everyday@yimengzh :~$ ๋ฌด๋ฃŒ์‚ฌ์šฉ๋œ ์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๊ณต์œ  ๋ฒ„ํผ ์บ์‹œ๋ฉ”๋ชจ: 65863816 60122060 5741756 9954628 1372688 51154916-/+ ๋ฒ„ํผ/์บ์‹œ: 7594456 58269360์Šค์™‘: 5917692 620 5917072~

nvidia-smi ์ถœ๋ ฅ

~~~
yimengzh_everyday@yimengzh :~$ nvidia-smi
2017๋…„ 4์›” 25์ผ ํ™” 17:52:38
+---------------------------------------------------------------- --------------------------+
| NVIDIA-SMI 375.39 ๋“œ๋ผ์ด๋ฒ„ ๋ฒ„์ „: 375.39 |
|------------------------------+-------------------- --+----------------------+
| GPU ์ด๋ฆ„ ์ง€์†์„ฑ-M| ๋ฒ„์Šค ID Disp.A | ํœ˜๋ฐœ์„ฑ ๋ถ€์ •ํ™•. ECC |
| ํŒฌ ์˜จ๋„ ์„ฑ๋Šฅ Pwr: ์‚ฌ์šฉ๋Ÿ‰/์บก | ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ | GPU ํ™œ์šฉ ์ปดํ“จํŒ… M. |
|====================================================== =====+========================|
| 0 GeForce GTX TIT... ๋„๊ธฐ | 0000:03:00.0 ๋„๊ธฐ | ํ•ด๋‹น ์—†์Œ |
| 30% 42C P8 14W / 250W | 3986MiB / 6082MiB | 0% ๊ธฐ๋ณธ๊ฐ’ |
+-------------------------------+-------------------- --+----------------------+
| 1 Tesla K40c ๋„๊ธฐ | 0000:81:00.0 ๋„๊ธฐ | ๋„๊ธฐ |
| 0% 46C P0 57W / 235W | 0MiB / 12205MiB | 0% ๊ธฐ๋ณธ๊ฐ’ |
+-------------------------------+-------------------- --+----------------------+

+---------------------------------------------------------------- --------------------------+
| ํ”„๋กœ์„ธ์Šค: GPU ๋ฉ”๋ชจ๋ฆฌ |
| GPU PID ์œ ํ˜• ํ”„๋กœ์„ธ์Šค ์ด๋ฆ„ ์‚ฌ์šฉ๋ฒ• |
|==================================================== ==============================|
| 0 16509 C ํŒŒ์ด์ฌ 3970MiB |
+---------------------------------------------------------------- --------------------------+
~~~

๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ๋Š” ์•„๋‹Œ ๊ฒƒ ๊ฐ™์•„์š”.

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—๋Š” ๋ณ„๋„์˜ ์ œํ•œ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ipcs -lm ๋˜๋Š” cat /proc/sys/kernel/shmall ๋ฐ cat /proc/sys/kernel/shmmax ๋ฅผ) ์‹œ๋„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ๋˜ํ•œ ๋” ์ ์€ ์ˆ˜์˜ ์ž‘์—…์ž๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๊นŒ(์˜ˆ: ์ž‘์—…์ž 1๋ช…์˜ ๊ทน๋‹จ์ ์ธ ๊ฒฝ์šฐ๋กœ ํ…Œ์ŠคํŠธ)?

@apaszke

~~~
yimengzh_everyday@yimengzh :~$ ipcs -lm

------ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ --------
์ตœ๋Œ€ ์„ธ๊ทธ๋จผํŠธ ์ˆ˜ = 4096
์ตœ๋Œ€ ์„ธ๊ทธ๋จผํŠธ ํฌ๊ธฐ(KB) = 18014398509465599
์ตœ๋Œ€ ์ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ(KB) = 18446744073642442748
์ตœ์†Œ ์„ธ๊ทธ๋จผํŠธ ํฌ๊ธฐ(๋ฐ”์ดํŠธ) = 1

yimengzh_everyday@yimengzh :~$ ๊ณ ์–‘์ด /proc/sys/kernel/shmall
18446744073692774399
yimengzh_everyday@yimengzh :~$ ๊ณ ์–‘์ด /proc/sys/kernel/shmmax
18446744073692774399
~~~

๊ทธ๋“ค์€ ๋‹น์‹ ์„ ์–ด๋–ป๊ฒŒ ์ฐพ๋‚˜์š”?

๋” ์ ์€ ์ˆ˜์˜ ๋…ธ๋™์ž์— ๊ด€ํ•ด์„œ๋Š”, ๋‚˜๋Š” ๊ทธ๊ฒƒ์ด ์ž์ฃผ ์ผ์–ด๋‚˜์ง€ ์•Š์„ ๊ฒƒ์ด๋ผ๊ณ  ๋ฏฟ์Šต๋‹ˆ๋‹ค. (์ง€๊ธˆ ์‹œ๋„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค). ํ•˜์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ๊ทธ๋ ‡๊ฒŒ ๋งŽ์€ ์ผ๊พผ์ด ํ•„์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์ตœ๋Œ€ 4096๊ฐœ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ธ๊ทธ๋จผํŠธ๊ฐ€ ํ—ˆ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ฌธ์ œ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. /proc/sys/kernel/shmmni (8192๋ฅผ ์‹œ๋„ํ•  ์ˆ˜ ์žˆ์Œ)์— ์ž‘์„ฑํ•˜์—ฌ ๊ฐ’์„ ๋Š˜๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜ํผ์œ ์ € ๊ถŒํ•œ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

@apaszke ๊ธ€์Ž„ ์ด๊ฒƒ๋“ค์€ Ubuntu์™€ CentOS 6์˜ ๊ธฐ๋ณธ๊ฐ’์ž…๋‹ˆ๋‹ค... ๊ทธ๊ฒŒ ์ •๋ง ๋ฌธ์ œ์ž…๋‹ˆ๊นŒ?

@apaszke ํ›ˆ๋ จ ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•  ๋•Œ ipcs -a ์‹ค์ œ๋กœ ์‚ฌ์šฉ ์ค‘์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ‘œ์‹œ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์˜ˆ์ƒ์ธ๊ฐ€์š”?

@apaszke ๋Š”

~~~
yimengzh_everyday@yimengzh :~$ ipcs -lm

------ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ --------
์ตœ๋Œ€ ์„ธ๊ทธ๋จผํŠธ ์ˆ˜ = 8192
์ตœ๋Œ€ ์„ธ๊ทธ๋จผํŠธ ํฌ๊ธฐ(KB) = 18014398509465599
์ตœ๋Œ€ ์ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ(KB) = 18446744073642442748
์ตœ์†Œ ์„ธ๊ทธ๋จผํŠธ ํฌ๊ธฐ(๋ฐ”์ดํŠธ) = 1
~~~

ํ•œ ๋ช…์˜ ์ž‘์—…์ž๋ฅผ ์‹œ๋„ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ฒซ์งธ, ๋Š๋ฆด ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‘˜์งธ, ๋ฌธ์ œ๊ฐ€ ์‹ค์ œ๋กœ ๋ฐ๋“œ ๋ฝํ‚น์ด๋ผ๋ฉด ํ™•์‹คํžˆ ์‚ฌ๋ผ์งˆ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@zym1010 ๊ธฐ๋ณธ ์„ค์ •์€ ์ด๋Ÿฌํ•œ ์ž‘์—… ๋ถ€ํ•˜๋ฅผ ์—ผ๋‘์— ๋‘๊ณ  ๋งŒ๋“ค ํ•„์š”๊ฐ€ ์—†์œผ๋ฏ€๋กœ ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ipcs ๋Š” ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” System V ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์šฉ์ด์ง€๋งŒ POSIX ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—๋„ ๋™์ผํ•œ ์ œํ•œ์ด ์ ์šฉ๋˜์ง€ ์•Š๋„๋ก ํ•˜๊ณ  ์‹ถ์—ˆ์Šต๋‹ˆ๋‹ค.

๋ฌธ์ œ๊ฐ€ ์‹ค์ œ๋กœ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ ์ž‘์—…์ž์™€ ๊ธฐ๋ณธ ํ”„๋กœ์„ธ์Šค ์‚ฌ์ด์˜ ๊ต์ฐฉ ์ƒํƒœ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์œผ๋ฉฐ ์ž‘์—…์ž ํ•œ ๋ช…์ด ์ด๋ฅผ ํŠธ๋ฆฌ๊ฑฐํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ™•์‹คํžˆ ์‚ฌ๋ผ์ง€์ง€๋Š” ์•Š์„ torch.__version__ ์˜ ๊ฐ’์€ ์–ผ๋งˆ์ž…๋‹ˆ๊นŒ? ๋„์ปค์—์„œ ์‹คํ–‰ ์ค‘์ด์‹ ๊ฐ€์š”?

@apaszke ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ด์ œ ๋‹น์‹ ์˜ ๋ถ„์„์„ ํ›จ์”ฌ ๋” ์ž˜ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

64GB RAM, ๋“€์–ผ Xeon ๋ฐ Titan Black(K40๋„ ์žˆ์ง€๋งŒ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ)์ด ์žˆ๋Š” Ubuntu 14.04 ์‹œ์Šคํ…œ์—์„œ ์ˆ˜ํ–‰๋˜๋Š” ๋ฐฉ๋ฒ•๊นŒ์ง€ ํ‘œ์‹œ๋˜๋Š” ๋‹ค๋ฅธ ๋ชจ๋“  ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ช…๋ น์€ CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 20 --lr 0.01 --workers 22 --batch-size 256 /mnt/temp_drive_3/cv_datasets/ILSVRC2015/Data/CLS-LOC ์ž…๋‹ˆ๋‹ค. ์ฝ”๋“œ๋ฅผ ์ „ํ˜€ ์ˆ˜์ •ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

Python 3.5์—์„œ pip๋ฅผ ํ†ตํ•ด pytorch๋ฅผ ์„ค์น˜ํ–ˆ์Šต๋‹ˆ๋‹ค. pytorch ๋ฒ„์ „์€ 0.1.11_5 ์ž…๋‹ˆ๋‹ค. Docker์—์„œ ์‹คํ–‰๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

BTW, ๋‚˜๋Š” ๋˜ํ•œ 1 ๋ช…์˜ ์ž‘์—…์ž๋ฅผ ์‚ฌ์šฉํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋‹ค๋ฅธ ๋จธ์‹ (128GB RAM, ๋“€์–ผ Xeon, 4 Pascal Titan X, CentOS 6)์—์„œ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 1 --lr 0.01 --workers 1 --batch-size 256 /ssd/cv_datasets/ILSVRC2015/Data/CLS-LOC ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ–‰ํ–ˆ๋Š”๋ฐ ์˜ค๋ฅ˜ ๋กœ๊ทธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Epoch: [0][5003/5005]   Time 2.463 (2.955)      Data 2.414 (2.903)      Loss 5.9677 (6.6311)    Prec<strong i="14">@1</strong> 3.516 (0.545)    Prec<strong i="15">@5</strong> 8.594 (2.262)
Epoch: [0][5004/5005]   Time 1.977 (2.955)      Data 1.303 (2.903)      Loss 5.9529 (6.6310)    Prec<strong i="16">@1</strong> 1.399 (0.545)    Prec<strong i="17">@5</strong> 7.692 (2.262)
^CTraceback (most recent call last):
  File "main.py", line 292, in <module>
    main()
  File "main.py", line 137, in main
    prec1 = validate(val_loader, model, criterion)
  File "main.py", line 210, in validate
    for i, (input, target) in enumerate(val_loader):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
    idx, batch = self.data_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()

top ๋Š” ์ž‘์—…์ž 1๋ช…๊ณผ ๋ถ™์–ด์žˆ์„ ๋•Œ ๋‹ค์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

~์ƒ๋‹จ - 08:34:33 ์ตœ๋Œ€ 15์ผ, 20:03, 0๋ช…์˜ ์‚ฌ์šฉ์ž, ๋กœ๋“œ ํ‰๊ท : 0.37, 0.39, 0.36์ž‘์—…: ์ด 894๊ฐœ, ์‹คํ–‰ 1๊ฐœ, ์ž ์ž๊ธฐ 892๊ฐœ, ์ค‘์ง€ 0๊ฐœ, ์ข€๋น„ 1๊ฐœCPU: 7.2%us, 2.8%sy, 0.0%ni, 89.7%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st๋ฉ”๋ชจ๋ฆฌ: ์ด 132196824k, 131461528k ์‚ฌ์šฉ, 735296k ์—ฌ์œ , 347448k ๋ฒ„ํผ์Šค์™‘: ์ด 2047996k, ์‚ฌ์šฉ 22656k, ์—ฌ์œ  ๊ณต๊ฐ„ 2025340k, ์บ์‹œ๋œ 125226796k~

๋‚ด๊ฐ€ ์ฐพ์€ ๋˜ ๋‹ค๋ฅธ ์‚ฌ์‹ค์€ ํ›ˆ๋ จ ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ๋ชจ๋“  ๋ฐฐ์น˜๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š๋„๋ก ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด 50๊ฐœ์˜ ๋ฐฐ์น˜๋งŒ ํ›ˆ๋ จ์‹œํ‚ต๋‹ˆ๋‹ค.

if i >= 50:
    break

๊ทธ๋Ÿฌ๋ฉด ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ์‚ฌ๋ผ์ง€๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ถ”๊ฐ€ ํ…Œ์ŠคํŠธ์— ๋”ฐ๋ฅด๋ฉด ์ปดํ“จํ„ฐ๋ฅผ ์žฌ๋ถ€ํŒ…ํ•œ ์งํ›„์— ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•˜๋ฉด ์ด๋Ÿฌํ•œ ์ •์ง€๊ฐ€ ํ›จ์”ฌ ๋” ์ž์ฃผ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ปดํ“จํ„ฐ์— ์•ฝ๊ฐ„์˜ ์บ์‹œ๊ฐ€ ์žˆ๋Š” ํ›„์—๋Š” ์ด ๋ฉˆ์ถค ํ˜„์ƒ์ด ๋œ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์‹œ๋„ํ–ˆ์ง€๋งŒ ์ด ๋ฒ„๊ทธ๋ฅผ ์–ด๋–ค ์‹์œผ๋กœ๋“  ์žฌํ˜„ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

๋น„์Šทํ•œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•œ ์—ํฌํฌ๋ฅผ ๋งˆ์น˜๋ฉด ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์ค‘์ง€๋˜๊ณ  ์ƒˆ ์—ํฌํฌ๊ฐ€ ์‹œ์ž‘๋ฉ๋‹ˆ๋‹ค.

num_workers = 0์œผ๋กœ ์„ค์ •ํ•˜๋ฉด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ”„๋กœ๊ทธ๋žจ์ด ๋Š๋ ค์ง‘๋‹ˆ๋‹ค.

@apaszke ๋จผ์ € ์ปดํ“จํ„ฐ๋ฅผ ์žฌ๋ถ€ํŒ…ํ•œ ๋‹ค์Œ ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•ด

ํ•œ ๊ฐ€์ง€ ์ง€์ ํ•˜๊ณ  ์‹ถ์€ ๊ฒƒ์€ OpenBLAS ์—ฐ๊ฒฐ numpy๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๊ณ  @soumith ์˜ anaconda ํด๋ผ์šฐ๋“œ์˜ MKL์ด ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— pip ์‚ฌ์šฉํ•˜์—ฌ pytorch๋ฅผ ์„ค์น˜ ํ–ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ๋ณธ์งˆ์ ์œผ๋กœ pytorch๋Š” MKL์„ ์‚ฌ์šฉํ•˜๊ณ  numpy๋Š” OpenBLAS๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ด์ƒ์ ์ด์ง€ ์•Š์„ ์ˆ˜ ์žˆ์ง€๋งŒ ์ด๊ฒƒ์ด ์—ฌ๊ธฐ์„œ ๋ฌธ์ œ์™€ ๊ด€๋ จ์ด ์—†์–ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ๊ทธ๊ฒƒ์„ ์กฐ์‚ฌํ–ˆ์ง€๋งŒ ๊ฒฐ์ฝ” ๊ทธ๊ฒƒ์„ ์žฌํ˜„ ํ•  ์ˆ˜ ์—†์—ˆ์Šต๋‹ˆ๋‹ค. MKL/OpenBLAS๋Š” ์ด ๋ฌธ์ œ์™€ ๊ด€๋ จ์ด ์—†์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์‹œ์Šคํ…œ ๊ตฌ์„ฑ์— ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

@apaszke ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋ฐฉ๊ธˆ aaconda ๊ณต์‹ repo์™€ MKL ๊ธฐ๋ฐ˜ pytorch์—์„œ python์„ ์‹œ๋„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ์ „ํžˆ ๊ฐ™์€ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

Docker์—์„œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์—ฌ์ „ํžˆ ๋ถ™์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

4๊ฐœ ์ค‘ 1๊ฐœ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ nvidia-docker ๋‚ด๋ถ€์—์„œ pytorch/examples imagenet ๊ต์œก ์˜ˆ์ œ(resnet18, 4๊ฐœ์˜ ์ž‘์—…์ž)๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ํ”„๋กœ์„ธ์Šค์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด gdb ์—ญ์ถ”์ ์„ ์ˆ˜์ง‘ํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. .

์ตœ์†Œํ•œ OpenBLAS๋Š” ํ–‰๋ ฌ ๊ณฑ์…ˆ์—์„œ ๊ต์ฐฉ ์ƒํƒœ ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋น„๊ต์  ๋“œ๋ฌผ๊ฒŒ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค( https://github.com/xiani/OpenBLAS/issues/937). ์ด ๋ฒ„๊ทธ๋Š” ์ ์–ด๋„ numpy 1.12.0์— ํŒจํ‚ค์ง•๋œ OpenBLAS์— ์กด์žฌํ–ˆ์Šต๋‹ˆ๋‹ค.

@jsainio ๋˜ํ•œ ์ˆœ์ˆ˜ MKL ๊ธฐ๋ฐ˜ PyTorch(numpy๋Š” MKL๊ณผ ์—ฐ๊ฒฐ๋จ)๋ฅผ ์‹œ๋„ํ–ˆ์ง€๋งŒ ๋™์ผํ•œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ์ด ๋ฌธ์ œ๋Š” ๋ฐ์ดํ„ฐ ๋กœ๋”์— ๋Œ€ํ•ด pin_memory ๋ฅผ ์ผœ๋ฉด (์ ์–ด๋„ ์ €์—๊ฒŒ๋Š”) ํ•ด๊ฒฐ๋ฉ๋‹ˆ๋‹ค.

2๋ช…์˜ ์ž‘์—…์ž๊ฐ€ ์‚ฌ๋งํ•œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ •์ƒ ์ž‘๋™ ์ค‘:

root<strong i="7">@b06f896d5c1d</strong>:~/mnt# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user+        1 33.2  4.7 91492324 3098288 ?    Ssl  10:51   1:10 python -m runne
user+       58 76.8  2.3 91079060 1547512 ?    Rl   10:54   1:03 python -m runne
user+       59 76.0  2.2 91006896 1484536 ?    Rl   10:54   1:02 python -m runne
user+       60 76.4  2.3 91099448 1559992 ?    Rl   10:54   1:02 python -m runne
user+       61 79.4  2.2 91008344 1465292 ?    Rl   10:54   1:05 python -m runne

์ž ๊ธˆ ํ›„:

root<strong i="11">@b06f896d5c1d</strong>:~/mnt# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user+        1 24.8  4.4 91509728 2919744 ?    Ssl  14:25  13:01 python -m runne
user+       58 51.7  0.0      0     0 ?        Z    14:27  26:20 [python] <defun
user+       59 52.1  0.0      0     0 ?        Z    14:27  26:34 [python] <defun
user+       60 52.0  2.4 91147008 1604628 ?    Sl   14:27  26:31 python -m runne
user+       61 52.0  2.3 91128424 1532088 ?    Sl   14:27  26:29 python -m runne

์•„์ง ๋‚จ์•„ ์žˆ๋Š” ์ž‘์—…์ž ์ค‘ ํ•˜๋‚˜์˜ ๊ฒฝ์šฐ gdb ์Šคํƒ ์ถ”์ ์˜ ์‹œ์ž‘์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

root<strong i="15">@b06f896d5c1d</strong>:~/mnt# gdb --pid 60
GNU gdb (GDB) 8.0
Attaching to process 60
[New LWP 65]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f36f52af827 in do_futex_wait.constprop ()
   from /lib/x86_64-linux-gnu/libpthread.so.0

(gdb) bt
#0  0x00007f36f52af827 in do_futex_wait.constprop ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f36f52af8d4 in __new_sem_wait_slow.constprop.0 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f36f52af97a in sem_wait@@GLIBC_2.2.5 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f36f157efb1 in semlock_acquire (self=0x7f3656296458,
    args=<optimized out>, kwds=<optimized out>)
    at /home/ilan/minonda/conda-bld/work/Python-3.5.2/Modules/_multiprocessing/semaphore.c:307
#4  0x00007f36f5579621 in PyCFunction_Call (func=
    <built-in method __enter__ of _multiprocessing.SemLock object at remote 0x7f3656296458>, args=(), kwds=<optimized out>) at Objects/methodobject.c:98
#5  0x00007f36f5600bd5 in call_function (oparg=<optimized out>,
    pp_stack=0x7f36c7ffbdb8) at Python/ceval.c:4705
#6  PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3236
#7  0x00007f36f5601b49 in _PyEval_EvalCodeWithName (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0,
    closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#8  0x00007f36f5601cd8 in PyEval_EvalCodeEx (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0,
    defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#9  0x00007f36f5557542 in function_call (
    func=<function at remote 0x7f36561c7d08>,
    arg=(<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7f3656296458>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7f3656296458>, _semlock=<_multiprocessing.SemLock at remote 0x7f3656296458>) at remote 0x7f3656296438>,), kw=0x0)
    at Objects/funcobject.c:627
#10 0x00007f36f5524236 in PyObject_Call (
    func=<function at remote 0x7f36561c7d08>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#11 0x00007f36f554077c in method_call (
    func=<function at remote 0x7f36561c7d08>,
    arg=(<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7f3656296458>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7f3656296458>, _semlock=<_multiprocessing.SemLock at remote 0x7f3656296458>) at remote 0x7f3656296438>,), kw=0x0)
    at Objects/classobject.c:330
#12 0x00007f36f5524236 in PyObject_Call (
    func=<method at remote 0x7f36556f9248>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#13 0x00007f36f55277d9 in PyObject_CallFunctionObjArgs (
    callable=<method at remote 0x7f36556f9248>) at Objects/abstract.c:2445
#14 0x00007f36f55fc3a9 in PyEval_EvalFrameEx (f=<optimized out>,
    throwflag=<optimized out>) at Python/ceval.c:3107
#15 0x00007f36f5601166 in fast_function (nk=<optimized out>, na=1,
    n=<optimized out>, pp_stack=0x7f36c7ffc418,
    func=<function at remote 0x7f36561c78c8>) at Python/ceval.c:4803
#16 call_function (oparg=<optimized out>, pp_stack=0x7f36c7ffc418)
    at Python/ceval.c:4730
#17 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3236
#18 0x00007f36f5601b49 in _PyEval_EvalCodeWithName (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=4, kws=0x7f36f5b85060, kwcount=0, defs=0x0, defcount=0,
    kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#19 0x00007f36f5601cd8 in PyEval_EvalCodeEx (_co=<optimized out>,
    globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
    argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0,
    defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#20 0x00007f36f5557661 in function_call (
    func=<function at remote 0x7f36e14170d0>,
    arg=(<ImageFolder(class_to_idx={'n04153751': 783, 'n02051845': 144, 'n03461385': 582, 'n04350905': 834, 'n02105056': 224, 'n02112137': 260, 'n03938244': 721, 'n01739381': 59, 'n01797886': 82, 'n04286575': 818, 'n02113978': 268, 'n03998194': 741, 'n15075141': 999, 'n03594945': 609, 'n04099969': 765, 'n02002724': 128, 'n03131574': 520, 'n07697537': 934, 'n04380533': 846, 'n02114712': 271, 'n01631663': 27, 'n04259630': 808, 'n04326547': 825, 'n02480855': 366, 'n02099429': 206, 'n03590841': 607, 'n02497673': 383, 'n09332890': 975, 'n02643566': 396, 'n03658185': 623, 'n04090263': 764, 'n03404251': 568, 'n03627232': 616, 'n01534433': 13, 'n04476259': 868, 'n03495258': 594, 'n04579145': 901, 'n04266014': 812, 'n01665541': 34, 'n09472597': 980, 'n02095570': 189, 'n02089867': 166, 'n02009229': 131, 'n02094433': 187, 'n04154565': 784, 'n02107312': 237, 'n04372370': 844, 'n02489166': 376, 'n03482405': 588, 'n04040759': 753, 'n01774750': 76, 'n01614925': 22, 'n01855032': 98, 'n03903868': 708, 'n02422699': 352, 'n01560419': 1...(truncated), kw={}) at Objects/funcobject.c:627
#21 0x00007f36f5524236 in PyObject_Call (
    func=<function at remote 0x7f36e14170d0>, arg=<optimized out>,
    kw=<optimized out>) at Objects/abstract.c:2165
#22 0x00007f36f55fe234 in ext_do_call (nk=1444355432, na=0,
    flags=<optimized out>, pp_stack=0x7f36c7ffc768,
    func=<function at remote 0x7f36e14170d0>) at Python/ceval.c:5034
#23 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
    at Python/ceval.c:3275
--snip--

๋ฉ”์ธ ํ”„๋กœ์„ธ์Šค๊ฐ€ ๋ฉˆ์ถ˜ ์ƒํƒœ์—์„œ ๋น„์Šทํ•œ ์˜ค๋ฅ˜ ๋กœ๊ทธ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. self.data_queue.get()
๋‚˜์—๊ฒŒ ๋ฌธ์ œ๋Š” opencv๋ฅผ ์ด๋ฏธ์ง€ ๋กœ๋”๋กœ ์‚ฌ์šฉํ–ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  cv2.imread ํ•จ์ˆ˜๋Š” imagenet์˜ ํŠน์ • ์ด๋ฏธ์ง€์—์„œ ์˜ค๋ฅ˜ ์—†์ด ๋ฌด๊ธฐํ•œ ์ค‘๋‹จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค("n01630670/n01630670_1010.jpeg").

num_workers = 0์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค๊ณ  ๋งํ•˜๋ฉด ๊ทธ๊ฒŒ ์•„๋‹™๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์œ ์‚ฌํ•œ ์˜ค๋ฅ˜ ์ถ”์ ์„ ๊ฐ€์ง„ ์ผ๋ถ€ ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ˜„์žฌ num_workers = 0 ๋กœ ํ…Œ์ŠคํŠธ๋ฅผ ์‹คํ–‰ ์ค‘์ด๋ฉฐ ์•„์ง ์ค‘๋‹จ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. https://github.com/pytorch/examples/blob/master/imagenet/main.py ์—์„œ ์˜ˆ์ œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๊ณ  pytorch/vision ImageFolder๋Š” ๋‚ด๋ถ€์ ์œผ๋กœ PIL ๋˜๋Š” pytorch/accimage ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ๋กœ๋“œํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ด๋ฏ€๋กœ OpenCV๊ฐ€ ๊ด€๋ จ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

num_workers = 4 ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋•Œ๋•Œ๋กœ ์ฒซ ๋ฒˆ์งธ epoch ๊ธฐ์ฐจ๋ฅผ ์–ป๊ณ  ์™„์ „ํžˆ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋‘ ๋ฒˆ์งธ epoch ์ค‘๊ฐ„์— ์ž ๊น๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฐ์ดํ„ฐ์…‹/๋กœ๋”ฉ ๊ธฐ๋Šฅ์— ๋ฌธ์ œ๊ฐ€ ์žˆ์„ ๊ฐ€๋Šฅ์„ฑ์€ ๋‚ฎ์Šต๋‹ˆ๋‹ค.

ํŠน์ • ํ•˜๋“œ์›จ์–ด/์†Œํ”„ํŠธ์›จ์–ด ์กฐํ•ฉ์— ์˜ํ•ด ๋น„๊ต์  ๋“œ๋ฌผ๊ฒŒ ํŠธ๋ฆฌ๊ฑฐ๋  ์ˆ˜ ์žˆ๋Š” ImageLoader์˜ ๊ฒฝ์Ÿ ์กฐ๊ฑด์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค.

@zym1010 ํฌ์ธํ„ฐ ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. DataLoader์—๋„ pin_memory = False ์„ค์ •ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

ํฅ๋ฏธ๋กœ์šด. ๋‚ด ์„ค์ •์—์„œ pin_memory = False ๋ฐ num_workers = 4 imagenet ์˜ˆ์ œ๋Š” ๊ฑฐ์˜ ์ฆ‰์‹œ ์ค‘๋‹จ๋˜๊ณ  ์ž‘์—…์ž ์ค‘ 3๋ช…์€ ์ข€๋น„ ํ”„๋กœ์„ธ์Šค๋กœ ๋๋‚ฉ๋‹ˆ๋‹ค.

root<strong i="8">@034c4212d022</strong>:~/mnt# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
user+        1  6.7  2.8 92167056 1876612 ?    Ssl  13:50   0:36 python -m runner
user+       38  1.9  0.0      0     0 ?        Z    13:51   0:08 [python] <defunct>
user+       39  4.3  2.3 91069804 1550736 ?    Sl   13:51   0:19 python -m runner
user+       40  2.0  0.0      0     0 ?        Z    13:51   0:09 [python] <defunct>
user+       41  4.1  0.0      0     0 ?        Z    13:51   0:18 [python] <defunct>

๋‚ด ์„ค์ •์—์„œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” NFS๋ฅผ ํ†ตํ•ด ์ฝ๋Š” ๋„คํŠธ์›Œํฌ ๋””์Šคํฌ์— ์žˆ์Šต๋‹ˆ๋‹ค. pin_memory = False ๋ฐ num_workers = 4 ํ•˜๋ฉด ์‹œ์Šคํ…œ์ด ์ƒ๋‹นํžˆ ๋นจ๋ฆฌ ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

=> creating model 'resnet18'
- training epoch 0
Epoch: [0][0/5005]  Time 10.713 (10.713)    Data 4.619 (4.619)  Loss 6.9555 (6.9555)    Prec<strong i="8">@1</strong> 0.000 (0.000)    Prec<strong i="9">@5</strong> 0.000 (0.000)
Traceback (most recent call last):
--snip--
imagenet_pytorch.main.main([data_dir, "--transient_dir", context.transient_dir])
  File "/home/user/mnt/imagenet_pytorch/main.py", line 140, in main

train(train_loader, model, criterion, optimizer, epoch, args)
  File "/home/user/mnt/imagenet_pytorch/main.py", line 168, in train

for i, (input, target) in enumerate(train_loader):
  File "/home/user/anaconda/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 206, in __next__

idx, batch = self.data_queue.get()
  File "/home/user/anaconda/lib/python3.5/multiprocessing/queues.py", line 345, in get

return ForkingPickler.loads(res)
  File "/home/user/anaconda/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd

fd = df.detach()
  File "/home/user/anaconda/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach

with _resource_sharer.get_connection(self._id) as conn:
  File "/home/user/anaconda/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection

c = Client(address, authkey=process.current_process().authkey)
  File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 493, in Client

answer_challenge(c, authkey)
  File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 732, in answer_challenge

message = connection.recv_bytes(256)         # reject large message
  File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes

buf = self._recv_bytes(maxlength)
  File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes

buf = self._recv(4)
  File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 379, in _recv

chunk = read(handle, remaining)
ConnectionResetError
: 
[Errno 104] Connection reset by peer

@zym1010 ๋„คํŠธ์›Œํฌ ๋””์Šคํฌ๋‚˜ ๊ธฐ์กด ํšŒ์ „ ๋””์Šคํฌ ์ค‘ ์ง€์—ฐ ์‹œ๊ฐ„ ๋“ฑ์ด ๋” ๋Š๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

@jsainio

ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ปดํ“จํŒ… ๋…ธ๋“œ์—์„œ ๋กœ์ปฌ SSD๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ๋Š” NFS ๋“œ๋ผ์ด๋ธŒ์— ์žˆ์ง€๋งŒ ๋ฐ์ดํ„ฐ๋Š” ์ตœ๋Œ€ ๋กœ๋”ฉ ์†๋„๋ฅผ ์œ„ํ•ด ๋กœ์ปฌ SSD์— ์žˆ์Šต๋‹ˆ๋‹ค. NFS ๋“œ๋ผ์ด๋ธŒ์— ๋ฐ์ดํ„ฐ ๋กœ๋“œ๋ฅผ ์‹œ๋„ํ•œ ์ ์ด ์—†์Šต๋‹ˆ๋‹ค.

@zym1010 ์ •๋ณด ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ปดํ“จํŒ… ๋…ธ๋“œ์—์„œ๋„ ์ด๊ฒƒ์„ ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์‹ค์ œ๋กœ num_workers = 4 ๋ณ€ํ˜•์„ ์‹œ๋„ํ•˜๋ฉด์„œ ๋™์‹œ์— ๋™์ผํ•œ ๋…ธ๋“œ์—์„œ num_workers = 0 ์‹คํ—˜์„ ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์‹คํ—˜์€ ๊ฐ€๋Šฅํ•œ ๊ฒฝ์Ÿ ์กฐ๊ฑด์ด ํ›„์ž์—์„œ ๋” ๋นจ๋ฆฌ ๋‚˜ํƒ€๋‚˜๋„๋ก ์ถฉ๋ถ„ํ•œ ๋ถ€ํ•˜๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

@apaszke ์ด์ „์— ์ด๊ฒƒ์„ ์žฌํ˜„ํ•˜๋ ค๊ณ  ํ•  ๋•Œ ๋‘ ๊ฐœ์˜ ์ธ์Šคํ„ด์Šค๋ฅผ ๋‚˜๋ž€ํžˆ ์‹คํ–‰ํ•˜๊ฑฐ๋‚˜ ์‹œ์Šคํ…œ์— ์ƒ๋‹นํ•œ ๋‹ค๋ฅธ ๋ถ€ํ•˜๋ฅผ

@jsainio ์กฐ์‚ฌํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ์ด์ƒํ•˜๊ฒŒ๋„ ์ž‘์—…์ž๋Š” ํ•จ๊ป˜ ์ข…๋ฃŒ๋˜์–ด์•ผ ํ•˜๋ฉฐ ๊ธฐ๋ณธ ํ”„๋กœ์„ธ์Šค๊ฐ€ ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ๋ฅผ ์™„๋ฃŒํ•œ ํ›„์—๋งŒ ์ข…๋ฃŒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์™œ ์กฐ๊ธฐ ์ข…๋ฃŒ๋˜๋Š”์ง€ ์กฐ์‚ฌํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ์ปค๋„ ๋กœ๊ทธ( dmesg )๋ฅผ ํ™•์ธํ•˜์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ?

์•„๋‡จ ์•ˆ์จ๋ดค๋Š”๋ฐ ์•ˆ๊ทธ๋Ÿด๋•Œ๋„ ๋‚˜์˜ค๋Š”๊ฑฐ๊ฐ™์€๋ฐ IIRC

@apaszke ์•Œ๊ฒ ์Šต๋‹ˆ๋‹ค . ์ž‘์—…์ž๊ฐ€ ๊ธฐ์ฉ๋‹ˆ๋‹ค .

์‹œ๋„ํ–ˆ์ง€๋งŒ ์ข…๋ฃŒ ์ด์œ ๋ฅผ ํ™•์ธํ•˜๋Š” ์ข‹์€ ๋ฐฉ๋ฒ•์„ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค. dmesg ์—๋Š” ๊ด€๋ จ ํ•ญ๋ชฉ์ด ํ‘œ์‹œ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. (์ €๋Š” Anaconda ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Ubuntu 16.04 ํŒŒ์ƒ Docker์—์„œ ์‹คํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.)

ํ•œ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์€ ์ž‘์—…์ž ๋ฃจํ”„ ๋‚ด๋ถ€์— ์—ฌ๋Ÿฌ ์ธ์‡„๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‚˜๋Š” ๊ทธ๋“ค์ด ์™œ ์กฐ์šฉํžˆ ํ‡ด์žฅํ•˜๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค. stderr์— ์ธ์‡„๋˜์—ˆ์„ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์˜ˆ์™ธ๋Š” ์•„๋‹ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋ฃจํ”„๋ฅผ ๋ฒ—์–ด๋‚˜๊ฑฐ๋‚˜ OS์— ์˜ํ•ด ์ฃฝ์Šต๋‹ˆ๋‹ค(์•„๋งˆ๋„ ์‹ ํ˜ธ์— ์˜ํ•ด?)

@jsainio , ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด --ipc=host ์™€ ํ•จ๊ป˜ docker๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ(์ด๊ฒƒ์„ ์–ธ๊ธ‰ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๊นŒ)? ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ธ๊ทธ๋จผํŠธ์˜ ํฌ๊ธฐ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ(df -h | grep shm)?

@ngimel ์ €๋Š” --shm-size=1024m ์žˆ์Šต๋‹ˆ๋‹ค. df -h | grep shm ๋ณด๊ณ :

root<strong i="9">@db92462e8c19</strong>:~/mnt# df -h | grep shm
shm                                                          1.0G  883M  142M  87% /dev/shm

๊ทธ ์‚ฌ์šฉ๋ฒ•์€ ๋‹ค์†Œ ๋†’์€ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋‘ ๋ช…์˜ ์ข€๋น„ ์ž‘์—…์ž๊ฐ€ ์žˆ๋Š” ๋„์ปค์— ์žˆ์Šต๋‹ˆ๋‹ค.

shm ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ๋ฐฉ๊ธˆ ํ™•์ธํ–ˆ๊ณ  ๋ฌธ์ œ๋ฅผ ์žฌํ˜„ํ•˜๋ ค๊ณ  ์‹œ๋„ํ•œ ์„œ๋ฒ„์—์„œ 16GB์˜€์Šต๋‹ˆ๋‹ค. ๋„์ปค ํ”Œ๋ž˜๊ทธ๋ฅผ ๋ณ€๊ฒฝํ•˜๊ฑฐ๋‚˜

mount -o remount,size=8G /dev/shm

๋ฐฉ๊ธˆ ํฌ๊ธฐ๋ฅผ 512MB๋กœ ์ค„์ด๋ ค๊ณ  ์‹œ๋„ํ–ˆ์ง€๋งŒ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ์•„๋‹Œ ๋ช…ํ™•ํ•œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ๐Ÿ˜• ์•„์ง๋„ ์žฌํ˜„์ด ์•ˆ๋˜๋„ค์š”

docker๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด shm์ด ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์„ ๋•Œ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์œ ๋ฅผ ๋ชจ๋ฅด๋Š” ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€๋ฅผ ์ง€์šฐ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ผ๋ฐ˜์ ์œผ๋กœ shm์„ ๋Š˜๋ฆฌ๋ฉด ํ•ด๊ฒฐ๋ฉ๋‹ˆ๋‹ค(1G์—์„œ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค).

์ข‹์•„, 10๊ฐœ์˜ ์ž‘์—…์ž๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ ๊ฐ™์ง€๋งŒ 4๊ฐœ์˜ โ€‹โ€‹์ž‘์—…์ž๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด /dev/shm ์‚ฌ์šฉ๋Ÿ‰์˜ 58%์—์„œ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค! ๋“œ๋””์–ด ์žฌํ˜„ํ–ˆ๋„ค

์ด ๋ฌธ์ œ์˜ ํ˜•์‹์„ ์žฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” #1579์— ์ค‘๋‹จ์„ ์œ ๋ฐœํ•˜๋Š” ์Šคํฌ๋ฆฝํŠธ๋ฅผ ๊ฒŒ์‹œํ–ˆ๋Š”๋ฐ ๋‹น์‹ ์€ ๊ทธ๊ฒƒ์ด ๋‹น์‹ ์˜ ์‹œ์Šคํ…œ์— ์ค‘๋‹จ๋˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๋Œ€๋‹ตํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ MacBook์—์„œ๋งŒ ํ…Œ์ŠคํŠธํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฐฉ๊ธˆ Linux์—์„œ ์‹œ๋„ํ–ˆ์ง€๋งŒ ์ค‘๋‹จ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ Linux์—์„œ๋งŒ ์‹œ๋„ํ–ˆ๋‹ค๋ฉด Mac์—์„œ๋„ ์‹œ๋„ํ•ด ๋ณผ ๊ฐ€์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์•Œ๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ฌธ์ œ๋ฅผ ์กฐ์‚ฌํ•œ ํ›„ ์ด์ƒํ•œ ๋ฌธ์ œ์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. /dev/shm ๋ฅผ 128MB๋กœ ์ œํ•œํ•˜๋”๋ผ๋„ Linux๋Š” 147MB ํŒŒ์ผ์„ ์ƒ์„ฑํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ์— ์™„์ „ํžˆ mmapํ•˜๋„๋ก ํ—ˆ์šฉํ•˜์ง€๋งŒ ์‹ค์ œ๋กœ ํŽ˜์ด์ง€์— ์•ก์„ธ์Šคํ•˜๋ ค๊ณ  ์‹œ๋„ํ•˜๋ฉด ์ž‘์—…์ž์—๊ฒŒ ์น˜๋ช…์ ์ธ SIGBUS๋ฅผ ๋ณด๋ƒ…๋‹ˆ๋‹ค. ... ๋“ฑ๋ก๋œ SIGBUS ํ•ธ๋“ค๋Ÿฌ๋กœ ํŽ˜์ด์ง€๋ฅผ ๋ฐ˜๋ณตํ•˜๊ณ  ๊ฐ ํŽ˜์ด์ง€๋ฅผ ๋งŒ์ง€๋Š” ๊ฒƒ์„ ์ œ์™ธํ•˜๊ณ ๋Š” ํŽ˜์ด์ง€์˜ ์œ ํšจ์„ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์ƒ๊ฐ๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค...

ํ˜„์žฌ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์€ ์œ„์— ํ‘œ์‹œ๋œ ๋Œ€๋กœ mount ๋ช…๋ น์œผ๋กœ /dev/shm ๋ฅผ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. 16GB๋กœ ์‹œ๋„ํ•˜์‹ญ์‹œ์˜ค(RAM์ด ์ถฉ๋ถ„ํ•œ ๊ฒฝ์šฐ).

์ด๊ฒƒ์— ๋Œ€ํ•œ ์–ธ๊ธ‰์„ ์ฐพ๊ธฐ๊ฐ€ ์–ด๋ ต์ง€๋งŒ ์—ฌ๊ธฐ์— ํ•˜๋‚˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ฌธ์ œ์— ๋Œ€ํ•ด ์‹œ๊ฐ„์„ ๋‚ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋Š” ์˜ค๋žซ๋™์•ˆ ์ €๋ฅผ ๋ฏธ์น˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค! ๋‚ด๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ดํ•ดํ–ˆ๋‹ค๋ฉด /dev/shm ๋ฅผ 8G ๋Œ€์‹  16G๋กœ ํ™•์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์˜๋ฏธ๊ฐ€ ์žˆ์ง€๋งŒ df -h ์‹œ๋„ํ•  ๋•Œ ๋ชจ๋“  ๋žจ์ด ์‹ค์ œ๋กœ ํ• ๋‹น๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (16G๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค)

tmpfs              7,8G    393M  7,4G   5% /dev/shm
tmpfs              5,0M    4,0K  5,0M   1% /run/lock
tmpfs              7,8G       0  7,8G   0% /sys/fs/cgroup
tmpfs              1,6G     60K  1,6G   1% /run/user/1001

์ด๊ฒƒ์€ ๊ต์ฐฉ ์ƒํƒœ ๋™์•ˆ df -h ์˜ ์ถœ๋ ฅ์ž…๋‹ˆ๋‹ค. ๋‚ด๊ฐ€ ์•„๋Š” ํ•œ, 16G์˜ SWAP ํŒŒํ‹ฐ์…˜์ด ์žˆ์œผ๋ฉด tmpfs๋ฅผ 32G๊นŒ์ง€ ๋งˆ์šดํŠธํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ /dev/shm ํ™•์žฅํ•˜๋Š” ๋ฐ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š๊ฒ ์ฃ ?

๋” ์ค‘์š”ํ•œ ๊ฒƒ์€ ๋‚ด RAM์˜ ๊ฑฐ์˜ ์ ˆ๋ฐ˜์„ ์ฐจ์ง€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— cgroup ํŒŒํ‹ฐ์…˜๊ณผ ๊ทธ ๋ชฉ์ ์— ๋Œ€ํ•ด ์˜์•„ํ•ดํ•ฉ๋‹ˆ๋‹ค. ๋ถ„๋ช…ํžˆ ๋‹ค์ค‘ ํ”„๋กœ์„ธ์„œ ์ž‘์—…์„ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์ง€๋งŒ ๊ทธ๊ฒƒ์ด ํ•˜๋Š” ์ผ๊ณผ ์™œ ํ•„์š”ํ•œ์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ฌผ๋ฆฌ์  RAM์„ shm์— ํ• ๋‹นํ•˜๊ธฐ ์œ„ํ•ด ๋ฌด์–ธ๊ฐ€๋ฅผ ๋ณ€๊ฒฝํ• ๊นŒ์š”? (ํฌ๊ธฐ๋ฅผ 16G๋กœ ์„ค์ •ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—) SWAP์— ๋„ฃ์œผ์‹ญ์‹œ์˜ค (๋‘˜ ๋‹ค ๋ถ€๋ถ„์ ์œผ๋กœ RAM๊ณผ SWAP์— ๋™์‹œ์— ์žˆ์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜์ง€๋งŒ)

@apaszke ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ๊ทผ๋ณธ์ ์ธ ์›์ธ์„ ์ฐพ์œผ์…จ๋‹ค๋‹ˆ ๋Œ€๋‹จํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ๋•Œ๋•Œ๋กœ ํ•œ ์ปดํ“จํ„ฐ์— ๋‹ค๋ฅธ ๋ถ€ํ•˜๊ฐ€ ๋ฌด์—‡์ธ์ง€์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ "ConnectionReset" ์˜ค๋ฅ˜์™€ ๋„์ปค --shm-size=1024m ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ชจ๋‘ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. --shm-size=16384m ๋ฐ 4๋ช…์˜ ์ž‘์—…์ž์™€ ํ•จ๊ป˜ ์ง€๊ธˆ ํ…Œ์ŠคํŠธ ์ค‘์ž…๋‹ˆ๋‹ค.

@jsainio ConnectionReset์ด ๊ฐ™์€ ๋ฌธ์ œ๋กœ ์ธํ•ด ๋ฐœ์ƒํ–ˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ”„๋กœ์„ธ์Šค๊ฐ€ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•˜๊ธฐ ์‹œ์ž‘ํ–ˆ์ง€๋งŒ shm ๊ณต๊ฐ„์ด ๋ถ€์กฑํ•˜๋ฉด SIGBUS๊ฐ€ ์ž‘์—…์ž์—๊ฒŒ ์ „์†ก๋˜์–ด ์ข…๋ฃŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

@ClementPinard ๋‚ด๊ฐ€ ์ดํ•ดํ•˜๋Š” ํ•œ ์›ํ•˜๋Š” ๋งŒํผ ํฌ๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ, RAM์ด ๋ถ€์กฑํ•˜๋ฉด ์‹œ์Šคํ…œ์ด ์ •์ง€๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค(์ปค๋„๋„ ์ด ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•ด์ œํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์—). /sys/fs/cgroup ๋Œ€ํ•ด ์‹ ๊ฒฝ์“ฐ์ง€ ์•Š์•„๋„ ๋ฉ๋‹ˆ๋‹ค. tmpfs ํŒŒํ‹ฐ์…˜์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋Š๋ฆฌ๊ฒŒ ํ• ๋‹นํ•˜๋ฏ€๋กœ ์‚ฌ์šฉ๋Ÿ‰์ด 0B๋กœ ์œ ์ง€๋˜๋Š” ํ•œ ๋น„์šฉ์ด ๋“ค์ง€ ์•Š์Šต๋‹ˆ๋‹ค(์ œํ•œ ํฌํ•จ). ์Šค์™‘์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ข‹์€ ์ƒ๊ฐ์ด ์•„๋‹ˆ๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋กœ๋“œ๊ฐ€ ํ›จ์”ฌ ๋Š๋ ค์ง€๋ฏ€๋กœ shm ํฌ๊ธฐ๋ฅผ 12GB๋กœ ๋Š˜๋ฆฌ๊ณ  ์ž‘์—…์ž ์ˆ˜๋ฅผ ์ œํ•œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(๋‚ด๊ฐ€ ๋งํ–ˆ๋“ฏ์ด, shm์— ๋ชจ๋“  RAM์„ ์‚ฌ์šฉํ•˜์ง€ ๋งˆ์‹ญ์‹œ์˜ค!). ๋‹ค์Œ ์€ ์ปค๋„ ๋ฌธ์„œ

/dev/shm ์‚ฌ์šฉ๋Ÿ‰์ด ๋งค์šฐ ์ ์€ ๊ฒฝ์šฐ์—๋„ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ์ด์œ ๋ฅผ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค(๋‚ด ์ปดํ“จํ„ฐ์—์„œ 20kB ๋ฐœ์ƒ). ์•„๋งˆ๋„ ์ปค๋„์€ ์ง€๋‚˜์น˜๊ฒŒ ๋‚™๊ด€์ ์ด์ง€๋งŒ ๋ชจ๋“  ๊ฒƒ์„ ์ฑ„์šธ ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฌ์ง€ ์•Š๊ณ  ์ด ์˜์—ญ์˜ ๋ชจ๋“  ๊ฒƒ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์‹œ์ž‘ํ•˜๋ฉด ํ”„๋กœ์„ธ์Šค๋ฅผ ์ข…๋ฃŒํ•ฉ๋‹ˆ๋‹ค.

์ง€๊ธˆ 12G์™€ ๋‚ด๊ฐ€ ๊ฐ€์ง„ ์ ˆ๋ฐ˜์˜ ์ž‘์—…์ž๋กœ ํ…Œ์ŠคํŠธํ–ˆ์ง€๋งŒ ์‹คํŒจํ–ˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๊ฒƒ์€ ๋ฃจ์•„ ํ† ์น˜ ๋ฒ„์ „(๋™์ผํ•œ ์†๋„, ๋™์ผํ•œ ์ˆ˜์˜ ์ž‘์—…์ž)์—์„œ ๋งค๋ ฅ์ฒ˜๋Ÿผ ์ž‘๋™ํ•˜์—ฌ ๋ฌธ์ œ๊ฐ€ ๋‹จ์ง€ /dev/shm ๊ด€๋ จ์ด ์žˆ๊ณ  python ๋‹ค์ค‘ ์ฒ˜๋ฆฌ์— ๋” ๊ฐ€๊น์ง€ ์•Š์€์ง€ ๊ถ๊ธˆํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค...

๊ทธ๊ฒƒ์— ๋Œ€ํ•ด ์ด์ƒํ•œ ์ ์€ (๋‹น์‹ ์ด ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด) /dev/shm ๊ฐ€ ๊ฒฐ์ฝ” ๊ฐ€๋“ /dev/shm ์ถ”์ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์•„๋งˆ๋„ ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ๋ณ€๊ฒฝ๋˜๋Š” ๋™์•ˆ ์ตœ๋Œ€ ์‚ฌ์šฉ๋Ÿ‰์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

@ClementPinard ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋” ๋†’๊ณ  Docker๊ฐ€ ์—†์œผ๋ฉด ์—ฌ์ „ํžˆ ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ† ์น˜ ๋ฒ„์ „ == Lua Torch์ธ ๊ฒฝ์šฐ ์—ฌ์ „ํžˆ /dev/shm ๊ด€๋ จ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Lua Torch๋Š” ์Šค๋ ˆ๋“œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ(GIL์ด ์—†์Œ) ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ต๊ณผํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค(๋ชจ๋‘ ๋‹จ์ผ ์ฃผ์†Œ ๊ณต๊ฐ„์„ ๊ณต์œ ํ•จ).

์ƒˆ๋กœ์šด ํ›ˆ๋ จ ๋˜๋Š” ๊ฒ€์ฆ ์—ํฌํฌ์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์—์„œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•  ์ˆ˜ ์—†๋‹ค๊ณ  ๋ถˆํ‰ํ•œ ํ›„ ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์ถฉ๋Œํ•˜๋Š” ๋™์ผํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์œ„์˜ ์†”๋ฃจ์…˜์€ (i) /dev/shm ๊ฐ€ 32GB์ด๊ณ  2.5GB ์ด์ƒ ์‚ฌ์šฉ๋œ ์ ์ด ์—†์œผ๋ฉฐ (ii) pin_memory=False ์„ค์ •์ด ์ž‘๋™ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ์•„๋งˆ๋„ ๊ฐ€๋น„์ง€ ์ˆ˜์ง‘๊ณผ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๊นŒ? ๋‚ด ์ฝ”๋“œ๋Š” ๋Œ€๋žต ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋ฌดํ•œ ๋ฐ˜๋ณต์ž๊ฐ€ ํ•„์š”ํ•˜๋ฏ€๋กœ ์•„๋ž˜ next() ์ฃผ์œ„๋ฅผ ์ œ์™ธํ•˜๊ณ  ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค :-)

def train():
    train_iter = train_loader.__iter__()
    for i in xrange(max_batches):
        try:
            x, y = next(train_iter)
        except StopIteration:
            train_iter = train_loader.__iter__()
        ...
    del train_iter

train_loader ๋Š” DataLoader ๊ฐœ์ฒด์ž…๋‹ˆ๋‹ค. ํ•จ์ˆ˜ ๋์— ๋ช…์‹œ์ ์ธ del train_iter ์ค„์ด ์—†์œผ๋ฉด ํ”„๋กœ์„ธ์Šค๋Š” ํ•ญ์ƒ 2-3 ์—ํฌํฌ ํ›„์— ์ถฉ๋Œํ•ฉ๋‹ˆ๋‹ค( /dev/shm ์—ฌ์ „ํžˆ 2.5GB๋ฅผ ํ‘œ์‹œํ•จ). ๋„์›€์ด ๋˜์—ˆ๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค!

4 ์ž‘์—…์ž(Ubuntu 16.04์˜ CUDA 8.0 ๋ฒ„์ „ 0.1.12_2 )๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ work_number๊ฐ€ ํด ๋•Œ ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ๋งŒ๋‚ฌ์Šต๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ์— ๋Œ€ํ•œ ๊ฐ€๋Šฅํ•œ ํ•ด๊ฒฐ์ฑ…์ด ์žˆ์Šต๋‹ˆ๊นŒ? ๋‚ด /dev/shm ํฌ๊ธฐ๋Š” 32GB์ด๊ณ  cuda 7.5, pytorch 0.1.12 ๋ฐ python 2.7.13์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ์‚ฌ๋ง ํ›„ ๊ด€๋ จ ์ •๋ณด์ž…๋‹ˆ๋‹ค. ๊ธฐ์–ต๊ณผ ๊ด€๋ จ๋œ ๊ฒƒ ๊ฐ™๋‹ค. @apaszke

default
image

@zhengyunqq ์‹œ๋„ pin_memory=False ๋‹น์‹ ์ด ๊ทธ๊ฒƒ์„ ์„ค์ •ํ•˜๋ฉด True . ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ๋ชจ๋ฆ…๋‹ˆ๋‹ค.

num_workers๊ฐ€ ํด ๋•Œ ๊ต์ฐฉ ์ƒํƒœ๋„ ๋งŒ๋‚ฌ์Šต๋‹ˆ๋‹ค.

๋‚˜์—๊ฒŒ ๋ฌธ์ œ๋Š” ์ž‘์—…์ž ์Šค๋ ˆ๋“œ๊ฐ€ ์–ด๋–ค ์ด์œ ๋กœ ์ฃฝ์œผ๋ฉด index_queue.put ๊ฐ€ ์˜์›ํžˆ ์ค‘๋‹จ๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ž‘์—… ์Šค๋ ˆ๋“œ๊ฐ€ ์ฃฝ์–ด๊ฐ€๋Š” ํ•œ ๊ฐ€์ง€ ์ด์œ ๋Š” ์ดˆ๊ธฐํ™” ์ค‘ unpickler๊ฐ€ ์‹คํŒจํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ 2017๋…„ 5์›” ๋งˆ์Šคํ„ฐ์˜ ์ด Python ๋ฒ„๊ทธ ์ˆ˜์ • ๊นŒ์ง€ ์ž‘์—…์ž ์Šค๋ ˆ๋“œ๊ฐ€ ์ฃฝ๊ณ  ๋์—†๋Š” ์ค‘๋‹จ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ œ ๊ฒฝ์šฐ์—๋Š” ์ผ๊ด„ ํ”„๋ฆฌํŽ˜์นญ ํ”„๋ผ์ด๋ฐ ๋‹จ๊ณ„์—์„œ ์ค‘๋‹จ์ด ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค.

SimpleQueue ์‚ฌ์šฉ๋œ DataLoaderIter ๋ฅผ Queue ๋กœ ๋Œ€์ฒดํ•˜์—ฌ ์šฐ์•„ํ•œ ์˜ˆ์™ธ ๋ฉ”์‹œ์ง€์™€ ํ•จ๊ป˜ ์‹œ๊ฐ„ ์ดˆ๊ณผ๋ฅผ ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค.

UPD: ์ œ๊ฐ€ ์ž˜๋ชป ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฒ„๊ทธ ์ˆ˜์ •์€ Queue ๊ฐ€ ์•„๋‹ˆ๋ผ SimpleQueue Queue ํŒจ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์ž‘์—…์ž ์Šค๋ ˆ๋“œ๊ฐ€ ์˜จ๋ผ์ธ ์ƒํƒœ๊ฐ€ ์•„๋‹Œ ๊ฒฝ์šฐ SimpleQueue ๊ฐ€ ์ž ๊ธฐ๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ โ€‹โ€‹์‚ฌ์‹ค์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ™•์ธํ•˜๋Š” ์‰ฌ์šด ๋ฐฉ๋ฒ•์€ ์ด ์ค„ ์„ self.workers = [] ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‚˜๋Š” ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ์žˆ๊ณ  shm (ํ—ˆ๊ฐ€์—†์ด)์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์•„๋งˆ๋„ Queue ๋˜๋Š” ๋‹ค๋ฅธ ๊ฒƒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ๋‚ซ์Šต๋‹ˆ๊นŒ?

๋น„์Šทํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ์ฝ”๋“œ๋Š” ๋ฉˆ์ถ”๊ณ  ์•„๋ฌด ๊ฒƒ๋„ ์ธ์‡„ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. num_workers=0์œผ๋กœ ์„ค์ •ํ•˜๋ฉด ์ž‘๋™ํ•˜์ง€๋งŒ

dataloader = DataLoader(transformed_dataset, batch_size=2, shuffle=True, num_workers=2)
model.cuda()
for i, batch in enumerate(dataloader):
 print(i)

๋ฃจํ”„ ๋’ค์— model.cuda()๋ฅผ ๋„ฃ์œผ๋ฉด ๋ชจ๋“  ๊ฒƒ์ด ์ž˜ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.

dataloader = DataLoader(transformed_dataset, batch_size=2, shuffle=True, num_workers=2)

for i, batch in enumerate(dataloader):
 print(i)
model.cuda()

๋ˆ„๊ตฌ๋“ ์ง€ ๊ทธ ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์ฑ…์ด ์žˆ์Šต๋‹ˆ๊นŒ?

ImageNet์„ ๊ต์œกํ•˜๋Š” ๋™์•ˆ์—๋„ ๋น„์Šทํ•œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠน์ • ์•„ํ‚คํ…์ฒ˜๊ฐ€ ์žˆ๋Š” ํŠน์ • ์„œ๋ฒ„์—์„œ ์ผ๊ด€๋˜๊ฒŒ ํ‰๊ฐ€์˜ ์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต์—์„œ ์ค‘๋‹จ๋˜์ง€๋งŒ(๋™์ผํ•œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ฐ€์ง„ ๋‹ค๋ฅธ ์„œ๋ฒ„ ๋˜๋Š” ์•„ํ‚คํ…์ฒ˜๊ฐ€ ๋‹ค๋ฅธ ๋™์ผํ•œ ์„œ๋ฒ„์—์„œ๋Š” ์•„๋‹˜) ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์‹œ ํ‰๊ฐ€ํ•˜๋Š” ๋™์•ˆ ํ•ญ์ƒ ์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต์ž…๋‹ˆ๋‹ค. ๋‚ด๊ฐ€ Torch๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ nccl์ด ์ด์™€ ๊ฐ™์€ ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ๋„๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๊นŒ?

๋‚˜๋Š” ๊ฐ™์€ ๋ฌธ์ œ์— ์ง๋ฉดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์‹ ๊ธฐ์›์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์—์„œ ๋ฌด์ž‘์œ„๋กœ ๋ฉˆ์ถฅ๋‹ˆ๋‹ค. ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋ชจ๋“  ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์€ ์ €์—๊ฒŒ ์ ํ•ฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. Ctrl-C๋ฅผ ๋ˆ„๋ฅด๋ฉด ๋‹ค์Œ์ด ์ธ์‡„๋ฉ๋‹ˆ๋‹ค.

Traceback (most recent call last):
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 44, in _worker_loop
    data_queue.put((idx, samples))
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/queues.py", line 354, in put
    self._writer.send_bytes(obj)
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/connection.py", line 398, in _send_bytes
    self._send(buf)
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
KeyboardInterrupt
Traceback (most recent call last):
  File "scripts/train_model.py", line 640, in <module>
    main(args)
  File "scripts/train_model.py", line 193, in main
    train_loop(args, train_loader, val_loader)
  File "scripts/train_model.py", line 341, in train_loop
    ee_optimizer.step()
  File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/site-packages/torch/optim/adam.py", line 74, in step
    p.data.addcdiv_(-step_size, exp_avg, denom)
KeyboardInterrupt

๋„์ปค ๋‚ด๋ถ€์˜ ๋‹จ์ผ ์ž‘์—…์ž์™€ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๋น„์Šทํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๊ณ  ์ œ ๊ฒฝ์šฐ์—๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ ๋„์ปค๋Š” 64MB์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋งŒ ํ• ๋‹นํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ด์ง€๋งŒ 1๋ช…์˜ ์ž‘์—…์ž์— ๋Œ€ํ•ด 440MB๊ฐ€ ํ•„์š”ํ–ˆ๋Š”๋ฐ, ์ด๋กœ ์ธํ•ด @apaszke์—์„œ ์„ค๋ช…ํ•œ ๋™์ž‘์ด ๋ฐœ์ƒํ–ˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ๊ฐ™์€ ๋ฌธ์ œ๋กœ ๊ณ ๋ฏผํ•˜๊ณ  ์žˆ์ง€๋งŒ ์ด ์Šค๋ ˆ๋“œ์˜ ๋Œ€๋ถ€๋ถ„์˜ ๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค๊ณผ ๋‹ค๋ฅธ ํ™˜๊ฒฝ์— ์žˆ์œผ๋ฏ€๋กœ ๋‚ด ์ž…๋ ฅ์ด ๊ทผ๋ณธ์ ์ธ ์›์ธ์„ ์ฐพ๋Š” ๋ฐ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‚ด pytorch๋Š” Windows10์—์„œ peterjc123์ด ๋นŒ๋“œํ•œ ์šฐ์ˆ˜ํ•œ conda ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„ค์น˜๋ฉ๋‹ˆ๋‹ค.

cifar10 ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ์ผ๋ถ€ cnn์„ ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋กœ๋”์˜ ๊ฒฝ์šฐ num_workers๋Š” 1๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค. num_workers > 0์ด๋ฉด BrokenPipeError๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ๊ณ  #494์—์„œ ๊ถŒ์žฅํ•˜์ง€ ์•Š์ง€๋งŒ ์ œ๊ฐ€ ๊ฒช๊ณ  ์žˆ๋Š” ๊ฒƒ์€ BrokenPipeError๊ฐ€ ์•„๋‹ˆ๋ผ ์ผ๋ถ€ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์˜ค๋ฅ˜์ž…๋‹ˆ๋‹ค. ์˜ค๋ฅ˜๋Š” ํ•ญ์ƒ ๋งˆ์ง€๋ง‰ epoch์˜ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์งํ›„์™€ ๋‹ค์Œ epoch์— ๋Œ€ํ•œ ํ›ˆ๋ จ ์‹œ์ž‘ ์ง์ „์ธ ์•ฝ 50 epoch์—์„œ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹œ๊ฐ„์˜ 90%๋Š” ์ •ํ™•ํžˆ 50 epoch์ด๊ณ  ๋‹ค๋ฅธ ๊ฒฝ์šฐ์—๋Š” 1 ๋˜๋Š” 2 epoch๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ๊ทธ ์™ธ์—๋Š” ๊ฑฐ์˜ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. num_workers=0์œผ๋กœ ์„ค์ •ํ•˜๋ฉด ์ด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋ฉ๋‹ˆ๋‹ค.

@paulguerrero ๋งž์Šต๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ 64M์—์„œ 2G๋กœ ๋Š˜๋ ค์„œ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๋„์ปค ์‚ฌ์šฉ์ž์—๊ฒŒ ์œ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

@berzjackson ๊ทธ๊ฒƒ์€ conda ํŒจํ‚ค์ง€์˜ ์•Œ๋ ค์ง„ ๋ฒ„๊ทธ์ž…๋‹ˆ๋‹ค. ์ตœ์‹  CI ๋นŒ๋“œ์—์„œ ์ˆ˜์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์›”์š”์ผ์— Pytorch๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๊ณผ์ •์„ ์‹œ์ž‘ํ•œ ~600๋ช…์˜ ์‚ฌ๋žŒ๋“ค์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํฌ๋Ÿผ์˜ ๋งŽ์€ ์‚ฌ๋žŒ๋“ค์ด ์ด ๋ฌธ์ œ๋ฅผ ๋ณด๊ณ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ถ€๋Š” AWS P2, ์ผ๋ถ€๋Š” ์ž์ฒด ์‹œ์Šคํ…œ(์ฃผ๋กœ GTX 1070, ์ผ๋ถ€ Titan X).

๊ทธ๋“ค์ด ํ›ˆ๋ จ์„ ์ค‘๋‹จํ•˜๋ฉด ์Šคํƒ ์ถ”์ ์˜ ๋์€ ๋‹ค์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

~/anaconda2/envs/fastai/lib/python3.6/multiprocessing/connection.py in _recv_bytes(self, maxsize)
    405 
    406     def _recv_bytes(self, maxsize=None):
--> 407         buf = self._recv(4)
    408         size, = struct.unpack("!i", buf.getvalue())
    409         if maxsize is not None and size > maxsize:

~/anaconda2/envs/fastai/lib/python3.6/multiprocessing/connection.py in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:

num_workers=4, pin_memory=False๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ €๋Š” ๊ทธ๋“ค์—๊ฒŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •์„ ํ™•์ธํ•˜๋„๋ก ์š”์ฒญํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ œ๊ฐ€ ํ•  ์ˆ˜ ์žˆ๋Š”(๋˜๋Š” Pytorch์—์„œ ํ•  ์ˆ˜ ์žˆ๋Š”) ์ผ์ด ์žˆ์Šต๋‹ˆ๊นŒ? (num_workers๋ฅผ ์ค„์ด๋Š” ๊ฒƒ ์™ธ์—๋Š” ์ž‘์—… ์†๋„๊ฐ€ ์ƒ๋‹นํžˆ ๋Š๋ ค์ง‘๋‹ˆ๋‹ค.)

์ €๋Š” @jph00 (Jeremy์—๊ฒŒ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค! :))์ด ์–ธ๊ธ‰๋œ ์ˆ˜์—…์— ์žˆ์Šต๋‹ˆ๋‹ค. "num_workers=0"๋„ ์‚ฌ์šฉํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค. resnet34๊ฐ€ ๋งค์šฐ ๋Š๋ฆฌ๊ฒŒ ๋กœ๋“œ๋˜๋Š” ๊ฒฝ์šฐ์—๋„ ๋™์ผํ•œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ํ”ผํŒ…๋„ ๋งค์šฐ ๋Š๋ฆฝ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด์ƒํ•œ ์ ์€ ๋…ธํŠธ๋ถ ์„ธ์…˜์˜ ์ˆ˜๋ช… ๋™์•ˆ ํ•œ ๋ฒˆ๋งŒ ๋ฐœ์ƒํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‹ค์‹œ ๋งํ•ด, ์ผ๋‹จ ๋ฐ์ดํ„ฐ๊ฐ€ ๋กœ๋“œ๋˜๊ณ  ํ”ผํŒ…์ด ํ•œ ๋ฒˆ ์‹คํ–‰๋˜๋ฉด 4๊ฐœ์˜ num_workers๋กœ๋„ ๊ณ„์† ์ด๋™ํ•˜๊ณ  ๋‹จ๊ณ„๋ฅผ ๋ฐ˜๋ณตํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋ชจ๋“  ๊ฒƒ์ด GPU์—์„œ ์˜ˆ์ƒ๋Œ€๋กœ ๋น ๋ฅด๊ฒŒ ์ž‘๋™ํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ €๋Š” PyTorch 0.2.0_4, Python 3.6.2, Torchvision 0.1.9, Ubuntu 16.04 LTS๋ฅผ ์‚ฌ์šฉ ์ค‘์ž…๋‹ˆ๋‹ค. ๋‚ด ํ„ฐ๋ฏธ๋„์—์„œ "df -h"๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด ์‚ฌ์šฉ๋ฅ ์ด ๋งค์šฐ ๋‚ฎ์•˜์ง€๋งŒ /dev/shm์— 16GB๊ฐ€ ์žˆ๋‹ค๊ณ  ๋งํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ๋กœ๋”ฉ์ด ์‹คํŒจํ•œ ์Šคํฌ๋ฆฐ์ƒท์ž…๋‹ˆ๋‹ค(์ฐธ๊ณ ๋กœ ์ €๋Š” ๋ฐ์ดํ„ฐ์— num_workers=0์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค).
(์†Œ๋ฌธ์ž๊ฐ€ ์ž‘์•„์„œ ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค. ๋‹ค ์บก์ณํ•˜๊ธฐ ์œ„ํ•ด ์คŒ ์•„์›ƒ์„ ํ•ด์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค...)

screenshot 2017-11-01 13 55 46

@apiltamang ๋™์ผํ•œ ๋ฌธ์ œ์ธ์ง€ ํ™•์‹คํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋™์ผํ•œ ์ฆ์ƒ์ฒ˜๋Ÿผ ๋“ค๋ฆฌ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ๊ฐ€ ์•„๋‹Œ fast.ai ํฌ๋Ÿผ์—์„œ ์ง„๋‹จํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ข‹์Šต๋‹ˆ๋‹ค.

์ตœ๋Œ€ํ•œ ๋นจ๋ฆฌ ์กฐ์‚ฌ ์ค‘์ž…๋‹ˆ๋‹ค!

@soumith ์ €๋Š” @apaszke ์—๊ฒŒ ์ฝ”์Šค์˜ ๋น„๊ณต๊ฐœ ํฌ๋Ÿผ์— ๋Œ€ํ•œ ์•ก์„ธ์Šค ๊ถŒํ•œ์„ ๋ถ€์—ฌํ–ˆ๊ณ  ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š” ํ•™์ƒ๋“ค์—๊ฒŒ ์ƒ์ž์— ๋กœ๊ทธ์ธํ•  ์ˆ˜ ์žˆ๋Š” ์•ก์„ธ์Šค ๊ถŒํ•œ์„ ๋ถ€์—ฌํ•˜๋„๋ก ์š”์ฒญํ–ˆ์Šต๋‹ˆ๋‹ค.

@jph00 ์•ˆ๋…•ํ•˜์„ธ์š” Jeremy, ํ•™์ƒ ์ค‘์— ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋†’์ด ๋ ค๊ณ  ์‹œ๋„ํ•œ ํ•™์ƒ์ด ์žˆ์Šต๋‹ˆ๊นŒ? ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?

@SsnL ํ•™์ƒ ์ค‘ ํ•œ ๋ช…์ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋Š˜

@jph00 ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ์œผ๋กœ ์ธํ•œ ์ค‘๋‹จ์„ ์„ฑ๊ณต์ ์œผ๋กœ ์žฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์ œ๊ฐ€ ๋‹ค๋ฅธ ๊ณณ์— ์žˆ๋Š” ๊ฒฝ์šฐ ๋” ๊นŠ์ด ํŒŒ๊ณ ๋“ค์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค! ์ €์™€ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ๊ณต์œ ํ•ด ์ฃผ์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ?

๋ฌผ๋ก ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋…ธํŠธ๋ถ์ด ์žˆ์Šต๋‹ˆ๋‹ค: https://github.com/fastai/fastai/blob/master/courses/dl1/lesson1.ipynb . ํ•™์ƒ๋“ค์€ ๋…ธํŠธ๋ถ์— ์žˆ๋Š” ์ˆœ์„œ๋Œ€๋กœ ๋ชจ๋“  ์…€์„ ์‹คํ–‰ํ•  ๋•Œ๋งŒ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•„์ฐจ๋ ธ์Šต๋‹ˆ๋‹ค. ๋…ธํŠธ๋ถ์— ์„ค๋ช…์ด ์ž˜ ๋˜์–ด ์žˆ๊ธฐ๋ฅผ ๋ฐ”๋ผ์ง€๋งŒ ๋…ธํŠธ๋ถ์„ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ๋ฌธ์ œ๊ฐ€ ์žˆ์œผ๋ฉด ์•Œ๋ ค์ฃผ์‹ญ์‹œ์˜ค. ์—ฌ๊ธฐ์—๋Š” ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋Š” ๋งํฌ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋ณต์ œํ•  ์ˆ˜ ์žˆ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋‚˜ ๋…ธํŠธ๋ถ์— ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๊นŒ?

@jph00 ์ง€๊ธˆ ๋ฐ”๋กœ ์ฝ”๋“œ์—

๋‚˜๋Š” ๋˜ํ•œ ๊ทธ๋ƒฅ ๋ฉˆ์ถ”๊ฒŒ ๋‘๋Š” ๊ฒƒ๋ณด๋‹ค shm ์ œํ•œ์— ๋„๋‹ฌํ–ˆ์„ ๋•Œ ๋ฉ‹์ง„ ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด PR์„ ๋ณด๋‚ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ™•์ธ ์ตœ์‹  Pytorch conda ์„ค์น˜์™€ ํ•จ๊ป˜ CUDA 9 AMI๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด AWS P2 ์ธ์Šคํ„ด์Šค์— ๋ฌธ์ œ๋ฅผ ๋ณต์ œํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ณต๊ฐœ ํ‚ค๋ฅผ ์ œ๊ณตํ•˜๋ฉด ์ง์ ‘ ์‚ฌ์šฉํ•ด ๋ณผ ์ˆ˜ ์žˆ๋Š” ์•ก์„ธ์Šค ๊ถŒํ•œ์„ ๋ถ€์—ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‚ด ์ด๋ฉ”์ผ์€ fast.ai์—์„œ ๋‚ด ์ด๋ฆ„์˜ ์ฒซ ๊ธ€์ž์ž…๋‹ˆ๋‹ค.

@jph00 ๋ฐฉ๊ธˆ ์ด๋ฉ”์ผ์„ ๋ณด๋ƒˆ์Šต๋‹ˆ๋‹ค :) ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!

@jph00 ๊ทธ๋ฆฌ๊ณ  ์ฐธ๊ณ ๋กœ, ์Šคํฌ๋ฆฝํŠธ๋Š” ๋‚ด ์ƒ์ž์—์„œ 400MB์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š” ํ•™์ƒ๋“ค์€ ๋ฌด๋ฃŒ shm์ด ์ถฉ๋ถ„ํ•œ์ง€ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

์ข‹์•„, ๋‚˜๋Š” opencv์™€ Pytorch ๋‹ค์ค‘ ์ฒ˜๋ฆฌ๊ฐ€ ๋•Œ๋•Œ๋กœ ํ•จ๊ป˜ ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ธฐ๋ณธ์ ์ธ ๋ฌธ์ œ๋ฅผ ์•Œ์•„ ๋ƒˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ ๋Œ€ํ•™์—์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ์—†์ง€๋งŒ AWS์—์„œ๋Š” ๋งŽ์€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค(P2 ์ธ์Šคํ„ด์Šค๊ฐ€ ํฌํ•จ๋œ ์ƒˆ๋กœ์šด ๋”ฅ ๋Ÿฌ๋‹ CUDA 9 AMI์—์„œ). ๋ชจ๋“  cv2 ํ˜ธ์ถœ ์ฃผ์œ„์— ์ž ๊ธˆ์„ ์ถ”๊ฐ€ํ•ด๋„ ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์ง€ ์•Š๊ณ  cv2.setNumThreads(0) ํ•ด๋„ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๊ทธ๊ฒƒ์„ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค :

from multiprocessing import set_start_method
set_start_method('spawn')

๊ทธ๋Ÿฌ๋‚˜ ์ด๋Š” ์„ฑ๋Šฅ์— ์•ฝ 15% ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค. opencv github ๋ฌธ์ œ์˜ ๊ถŒ์žฅ ์‚ฌํ•ญ์€ https://github.com/tomMoral/loky ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ „์— ๊ทธ ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ–ˆ๊ณ  ๊ทธ๊ฒƒ์ด ๊ฒฌ๊ณ ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•˜๋‹ค. ์ง€๊ธˆ ๋‹น์žฅ์€ ์ถฉ๋ถ„ํžˆ ์ž˜ ์ž‘๋™ํ•˜๋Š” ์†”๋ฃจ์…˜์ด ์žˆ์œผ๋ฏ€๋กœ ๊ธด๊ธ‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Dataloader์— Loky๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•ด ๋ณผ ๊ฐ€์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ?

์•„๋งˆ๋„ ๋” ์ค‘์š”ํ•œ ๊ฒƒ์€ ์ด๋Ÿฌํ•œ ๋ฌดํ•œ ์ค‘๋‹จ์ด ์žกํž ์ˆ˜ ์žˆ๋„๋ก ์ตœ์†Œํ•œ pytorch์˜ ๋Œ€๊ธฐ์—ด์— ์ผ์ข…์˜ ์‹œ๊ฐ„ ์ดˆ๊ณผ๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ฐธ๊ณ ๋กœ, '์Šคํฐ'์ด ์ผ๋ถ€ ๋ถ€ํ’ˆ์„ 2-3๋ฐฐ ๋Š๋ฆฌ๊ฒŒ ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ์ˆ˜์ •์„ ์‹œ๋„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์€ ๋˜ํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๋น„๋ก ์ด์ƒ์ ์ด์ง€๋Š” ์•Š์ง€๋งŒ!

ํŒŒํ—ค์ณ ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ๋‘ ๊ฐ€์ง€ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ์ฐพ์œผ์…จ๋‹ค๋‹ˆ ๋‹คํ–‰์ž…๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ธ๋ฑ์‹ฑํ•  ๋•Œ ์‹œ๊ฐ„ ์ดˆ๊ณผ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋…ผ์˜ํ•˜๊ณ  ๋‚ด์ผ ๊ฒฝ๋กœ์— ๋Œ€ํ•ด ๋‹ค์‹œ ์—ฐ๋ฝ ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.

cc @soumith ๋Š” ์šฐ๋ฆฌ๊ฐ€ ์กฐ์‚ฌํ•˜๊ณ  ์‹ถ์€ ๋กœํ‚ค์ธ๊ฐ€์š”?

์œ„์˜ ํ† ๋ก ์„ ์œ„ํ•ด ์ด ์Šค๋ ˆ๋“œ์— ์˜ค๋Š” ์‚ฌ๋žŒ๋“ค์„ ์œ„ํ•ด opencv ๋ฌธ์ œ๋Š” https://github.com/opencv/opencv/issues/5150 ์—์„œ ๋” ๊นŠ์ด ๋…ผ์˜๋ฉ๋‹ˆ๋‹ค.

์ด์ œ ์ด์— ๋Œ€ํ•œ ์ ์ ˆํ•œ ์ˆ˜์ • ์‚ฌํ•ญ์ด ์žˆ๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. Dataloader๋ฅผ ProcessPoolExecutor.map() ์‚ฌ์šฉ์ž๋กœ ๋‹ค์‹œ ์ž‘์„ฑํ•˜๊ณ  ํ…์„œ ์ƒ์„ฑ์„ ์ƒ์œ„ ํ”„๋กœ์„ธ์Šค๋กœ ์˜ฎ๊ฒผ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ์›๋ž˜ Dataloader์—์„œ ๋ณธ ๊ฒƒ๋ณด๋‹ค ๋” ๋น ๋ฅด๋ฉฐ ๋‚ด๊ฐ€ ์‹œ๋„ํ•œ ๋ชจ๋“  ์ปดํ“จํ„ฐ์—์„œ ์•ˆ์ •์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ๋„ ํ›จ์”ฌ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ์— ๊ด€์‹ฌ์ด ์žˆ๋Š” ์‚ฌ๋žŒ์ด ์žˆ์œผ๋ฉด https://github.com/fastai/fastai/blob/master/fastai/dataloader.py ์—์„œ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

API๋Š” Dataset์ด Pytorch ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š์•„์•ผ ํ•œ๋‹ค๋Š” ์ ์„ ์ œ์™ธํ•˜๊ณ ๋Š” ํ‘œ์ค€ ๋ฒ„์ „๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” numpy ๋ฐฐ์—ด ๋˜๋Š” python ๋ชฉ๋ก์„ ๋ฐ˜ํ™˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ Python์—์„œ ์ž‘๋™ํ•˜๋„๋ก ํ•˜๋ ค๋Š” ์‹œ๋„๋ฅผ ํ•˜์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ ๊ฑฐ๊ธฐ์— ๋ช‡ ๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ์–ด๋„ ๋†€๋ผ์ง€ ์•Š์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

(๋‚ด๊ฐ€ ์ด ๊ธธ์„ ํƒํ•œ ์ด์œ ๋Š” ์ตœ๊ทผ GPU์—์„œ ๋งŽ์€ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ/์ฆ๊ฐ• ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ๋•Œ Pytorch CPU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์ „ ์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด GPU๋ฅผ ๊ณ„์† ๋ฐ”์˜๊ฒŒ ์œ ์ง€ํ•  ๋งŒํผ ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌ๋ฅผ ์™„๋ฃŒํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ opencv๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ๋” ๋นจ๋ž๊ณ  ๊ฒฐ๊ณผ์ ์œผ๋กœ GPU๋ฅผ ์ถฉ๋ถ„ํžˆ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.)

์˜ค, ๋งŒ์•ฝ ๊ทธ๊ฒƒ์ด opencv ๋ฌธ์ œ๋ผ๋ฉด ์šฐ๋ฆฌ๊ฐ€ ๊ทธ๊ฒƒ์— ๋Œ€ํ•ด ํ•  ์ˆ˜ ์žˆ๋Š” ์ผ์ด ๋งŽ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ ํ’€์ด ์žˆ์„ ๋•Œ ํฌํฌ๊ฐ€ ์œ„ํ—˜ํ•œ ๊ฒƒ์€ ์‚ฌ์‹ค์ž…๋‹ˆ๋‹ค. ํŠนํžˆ PyTorch ํ…์„œ๋ฅผ ์ œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•  ๊ฒƒ์ด๋ผ๋Š” ์ ์—์„œ ๋Ÿฐํƒ€์ž„ ์ข…์†์„ฑ์„ ์ถ”๊ฐ€ํ•˜๊ณ  ์‹ถ์ง€ ์•Š๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๊ต์ฐฉ ์ƒํƒœ์˜ ์›์ธ๊ณผ @SsnL ์ด ์žˆ๋Š” ์›์ธ์„ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

@jph00 Pillow-SIMD๋ฅผ ์‚ฌ์šฉํ•ด

์˜ˆ, ๋‚˜๋Š” pillow-SIMD๋ฅผ ์ž˜ ์••๋‹ˆ๋‹ค. ํฌ๊ธฐ ์กฐ์ •, ํ๋ฆผ ๋ฐ RGB ๋ณ€ํ™˜ ์†๋„๋งŒ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์—์„œ ํ•  ์ˆ˜ ์žˆ๋Š” ์ผ์ด ๋งŽ์ง€ ์•Š๋‹ค๋Š” ๋ฐ ๋™์˜ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ •ํ™•ํžˆ opencv ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋ฉฐ(pytorch์˜ ํŠน์ˆ˜ ์ผ€์ด์Šค ๋‹ค์ค‘ ์ฒ˜๋ฆฌ ๋ชจ๋“ˆ์€ ๊ณ ์‚ฌํ•˜๊ณ  ๋” ์ผ๋ฐ˜์ ์œผ๋กœ ์ด๋Ÿฌํ•œ ์œ ํ˜•์˜ python ๋‹ค์ค‘ ์ฒ˜๋ฆฌ๋ฅผ ์ง€์›ํ•œ๋‹ค๊ณ  ์ฃผ์žฅํ•˜์ง€ ์•Š์Œ) Pytorch ๋ฌธ์ œ๋„ ์•„๋‹™๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Pytorch๊ฐ€ ์–ด๋–ค ์ข…๋ฅ˜์˜ ์˜ค๋ฅ˜๋„ ์ œ๊ณตํ•˜์ง€ ์•Š๊ณ  ์กฐ์šฉํžˆ ์˜์›ํžˆ ๊ธฐ๋‹ค๋ฆฐ๋‹ค๋Š” ์‚ฌ์‹ค์€ (IMO) ๋‹น์‹ ์ด ๊ณ ์น  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๊ณ , ๋” ์ผ๋ฐ˜์ ์œผ๋กœ ๋งŽ์€ ๋˜‘๋˜‘ํ•œ ์‚ฌ๋žŒ๋“ค์ด ์ง€๋‚œ ๋ช‡ ๋…„ ๋™์•ˆ ๋ฌธ์ œ๋ฅผ ํ”ผํ•˜๋Š” ๊ฐœ์„ ๋œ ๋‹ค์ค‘ ์ฒ˜๋ฆฌ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์—ด์‹ฌํžˆ ๋…ธ๋ ฅํ•ด ์™”์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ™์€. ์™ธ๋ถ€ ์ข…์†์„ฑ์„ ๊ฐ€์ ธ์˜ค์ง€ ์•Š๊ณ  ๊ทธ๋“ค์ด ์‚ฌ์šฉํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ์‹์—์„œ ์ฐจ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Loky ๋’ค์— ์žˆ๋Š” ์‚ฌ๋žŒ ์ค‘ ํ•œ ๋ช…์ธ Olivier Grisel์€ Python์˜ ๋‹ค์ค‘ ์ฒ˜๋ฆฌ ์ƒํƒœ๋ฅผ ์š”์•ฝํ•œ ํ›Œ๋ฅญํ•œ ์Šฌ๋ผ์ด๋“œ ๋ฐํฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. http://ogrisel.github.io/decks/2017_euroscipy_parallelism/

๋‚˜๋Š” ์ด์ œ ๋ฌธ์ œ๊ฐ€ ์—†๋Š” ์ƒˆ๋กœ์šด Dataloader๋ฅผ ์ž‘์„ฑํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋Š ์ชฝ์ด๋“  ์ƒ๊ด€ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ FWIW์—์„œ๋Š” pytorch์˜ ๋‹ค์ค‘ ์ฒ˜๋ฆฌ์™€ ๋‹ค๋ฅธ ์‹œ์Šคํ…œ ๊ฐ„์˜ ์ƒํ˜ธ ์ž‘์šฉ์ด ๋ฏธ๋ž˜์— ๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค์—๊ฒŒ๋„ ๋ฌธ์ œ๊ฐ€ ๋  ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋งŒํ•œ ๊ฐ€์น˜๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์šฐ๋ถ„ํˆฌ 14.04์˜ Python 2.7์—์„œ ์ด ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‚ด ๋ฐ์ดํ„ฐ ๋กœ๋”๋Š” sqlite ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์ฝ๊ณ  num_workers=0 ์™€ ์™„๋ฒฝํ•˜๊ฒŒ ์ž‘๋™ํ–ˆ์œผ๋ฉฐ ๋•Œ๋กœ๋Š” num_workers=1 ๊ดœ์ฐฎ์•„ ๋ณด์˜€๊ณ  ๋” ๋†’์€ ๊ฐ’์— ๋Œ€ํ•ด ๋งค์šฐ ๋น ๋ฅด๊ฒŒ ๊ต์ฐฉ ์ƒํƒœ์— ๋น ์กŒ์Šต๋‹ˆ๋‹ค. ์Šคํƒ ์ถ”์ ์€ recv_bytes ์ค‘๋‹จ๋œ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ๋“ค:

  • ๋„์ปค๋ฅผ ์‹œ์ž‘ํ•  ๋•Œ --shm-size 8G ๋˜๋Š” --ipc=host
  • echo 16834 | sudo tee /proc/sys/kernel/shmmni ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ธ๊ทธ๋จผํŠธ ์ˆ˜๋ฅผ ๋Š˜๋ฆฝ๋‹ˆ๋‹ค(๊ธฐ๋ณธ๊ฐ’์€ ๋‚ด ์ปดํ“จํ„ฐ์—์„œ 4096).
  • pin_memory=True ๋˜๋Š” pin_memory=False ์ค‘ ์–ด๋Š ๊ฒƒ๋„ ๋„์›€์ด ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

๋‚ด ๋ฌธ์ œ๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ํ•ด๊ฒฐํ•œ ๊ฒƒ์€ ๋‚ด ์ฝ”๋“œ๋ฅผ Python 3์œผ๋กœ ์ด์‹ํ•˜๋Š” ๊ฒƒ์ด์—ˆ์Šต๋‹ˆ๋‹ค. (Anaconda์˜) Python 3.6 ์ธ์Šคํ„ด์Šค ๋‚ด์—์„œ ๋™์ผํ•œ ๋ฒ„์ „์˜ Torch๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๋‚ด ๋ฌธ์ œ๊ฐ€ ์™„์ „ํžˆ ํ•ด๊ฒฐ๋˜์—ˆ์œผ๋ฉฐ ์ด์ œ ๋ฐ์ดํ„ฐ ๋กœ๋“œ๊ฐ€ ๋” ์ด์ƒ ์ค‘๋‹จ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

@apaszke ์ฐธ๊ณ ๋กœ, opencv์™€ ํ•จ๊ป˜ ์ž˜ ์ž‘๋™ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ (๊ทธ๋ฆฌ๊ณ  ํ† ์น˜์ƒ˜ํ”Œ์ด ์ข‹์€ ์˜ต์…˜์ด ์•„๋‹Œ ์ด์œ  - <200๊ฐœ ์ด๋ฏธ์ง€/์ดˆ์˜ ํšŒ์ „์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ!):
image

๋ˆ„๊ตฌ๋“ ์ง€์ด ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์ฑ…์„ ์ฐพ์•˜์Šต๋‹ˆ๊นŒ?

@iqbalu ์œ„์˜ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์‚ฌ์šฉํ•ด๋ณด์‹ญ์‹œ์˜ค: https://github.com/fastai/fastai/blob/master/fastai/dataloader.py
๋‚ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์ง€๋งŒ num_workers=0 ์ง€์›ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

@elbaro ์‹ค์ œ๋กœ ์‹œ๋„ํ–ˆ๋Š”๋ฐ ์ œ ๊ฒฝ์šฐ์—๋Š” ์—ฌ๋Ÿฌ ์ž‘์—…์ž๋ฅผ ์ „ํ˜€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๊ฑฐ๊ธฐ์—์„œ ๋ณ€๊ฒฝํ•œ ๊ฒƒ์ด ์žˆ์Šต๋‹ˆ๊นŒ?

@iqbalu fast.ai ๋ฐ์ดํ„ฐ ๋กœ๋”๋Š” ์ž‘์—…์ž ํ”„๋กœ์„ธ์Šค๋ฅผ ์ƒ์„ฑํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋งŒ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์ผ๋ถ€ ๋„๊ตฌ์—๋Š” ํ‘œ์‹œ๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

@apaszke @elbaro @jph00 fast.ai ์˜ ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ ์†๋„๋ฅผ 10๋ฐฐ ์ด์ƒ ๋Šฆ์ท„์Šต๋‹ˆ๋‹ค. num_workers=8์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์–ด๋–ค ํžŒํŠธ๊ฐ€ ๊ทธ ์ด์œ ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ GIL์„ ํฌ๊ธฐํ•˜์ง€ ์•Š๋Š” ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

@apaszke ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋ช‡

@iqbalu ์ •๋ง ์•„๋‹™๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ์ผ์ด ์ผ์–ด๋‚˜์„œ๋Š” ์•ˆ ๋œ๋‹ค

๋‚˜๋Š” ๋งŽ์€ ๊ฒƒ์„ ์‹œ๋„ํ–ˆ๊ณ  cv2.setNumThreads(0) ๋งˆ์นจ๋‚ด ๋‚ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค.

@jph00 ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค

๋‚˜๋Š” ์ตœ๊ทผ์— ์ด ๋ฌธ์ œ๋กœ ๊ณ ๋ฏผํ–ˆ๋‹ค. cv2.setNumThreads(0) ์ด(๊ฐ€) ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋Œ€์‹  scikit-image๋ฅผ ์‚ฌ์šฉํ•˜๋„๋ก ๋ชจ๋“  cv2 ์ฝ”๋“œ๋ฅผ ๋ณ€๊ฒฝํ–ˆ์ง€๋งŒ ๋ฌธ์ œ๋Š” ์—ฌ์ „ํžˆ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ /dev/shm 16G๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ GPU๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ๋งŒ ์ด ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๊ฒƒ์ด ๋‹จ์ผ GPU์—์„œ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์†”๋ฃจ์…˜์— ๋Œ€ํ•ด ์ƒˆ๋กœ์šด ์ƒ๊ฐ์ด ์žˆ๋Š” ์‚ฌ๋žŒ์ด ์žˆ์Šต๋‹ˆ๊นŒ?

๊ฐ™์€ ์˜ค๋ฅ˜. ๋‹จ์ผ GPU๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์ด ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚˜๋ฅผ ์œ„ํ•ด opencv ์Šค๋ ˆ๋“œ๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•˜๋ฉด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
cv2.setNumThreads(0)

pytorch 0.3, cuda 8.0, ubuntu 16.04๋„ ์‚ฌ์šฉํ•ด ๋ณด์„ธ์š”.
opencv๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ €๋Š” pytorch 0.3, cuda 8.0, ubuntu 14.04๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. cv2.resize() ์‚ฌ์šฉ์„ ์‹œ์ž‘ํ•œ ํ›„ ์ด ์ค‘๋‹จ์„ ๊ด€์ฐฐํ–ˆ์Šต๋‹ˆ๋‹ค.

cv2.setNumThreads(0)์ด ๋‚ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค.

2๊ฐœ์˜ 1080Ti ๋ฐ 32GB RAM์ด ์žˆ๋Š” ์‹œ์Šคํ…œ์—์„œ python 3.6, pytorch 0.3.0, cuda 8.0 ๋ฐ ubuntu 17.04๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚ด ๋ฐ์ดํ„ฐ ์„ธํŠธ์— 8๋ช…์˜ ์ž‘์—…์ž๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ์ž์ฃผ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค(์ฒซ ๋ฒˆ์งธ ์—ํฌํฌ์—์„œ ๋ฐœ์ƒ). ์ž‘์—…์ž๋ฅผ 4๋กœ ์ค„์ด๋ฉด ์‚ฌ๋ผ์ง‘๋‹ˆ๋‹ค(80 Epoch ์‹คํ–‰).

๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด RAM์— ~10GB์˜ ์—ฌ์œ  ๊ณต๊ฐ„์ด ์žˆ์Šต๋‹ˆ๋‹ค.

screenshot from 2018-03-02 19-57-47

์—ฌ๊ธฐ์—์„œ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ข…๋ฃŒํ•œ ํ›„ ๋กœ๊ทธ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. https://gist.github.com/milani/42f50c023cdca407115b309237d29c70

์—…๋ฐ์ดํŠธ: SHMMNI ์ฆ๊ฐ€๋กœ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. Ubuntu 17.04์—์„œ kernel.shmmni=8192 ๋ฅผ /etc/sysctl.conf .

Ubuntu 17.10, Python 3.6, Pytorch 0.3.1, CUDA 8.0์—์„œ๋„ ์ด ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  ์‹œ๊ฐ„์ด ์ผ์น˜ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ผ ๋•Œ RAM์ด ์ถฉ๋ถ„ํžˆ ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์—ํฌํฌ ์ดํ›„ ๋˜๋Š” 200๋ฒˆ์งธ ์ดํ›„์— ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

kernel.shmmni=8192 ์™€ cv2.setNumThreads(0) ์„ ์กฐํ•ฉํ•˜๋ฉด ํ•ด๊ฒฐ๋œ ๊ฒƒ ๊ฐ™์ง€๋งŒ ๊ฐœ๋ณ„์ ์œผ๋กœ๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ œ ๊ฒฝ์šฐ์—๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค. num_workers=4๋กœ ์„ค์ •ํ•˜๋ฉด ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. Ubuntu 17.10, Pytorch 0.3.1, CUDA 9.1, python 3.6์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. 4๊ฐœ์˜ python ์Šค๋ ˆ๋“œ๊ฐ€ ์žˆ์œผ๋ฉฐ, ๊ฐ ์Šค๋ ˆ๋“œ๋Š” CPU(4๊ฐœ ์ฝ”์–ด)๊ฐ€ ์œ ํœด ์ƒํƒœ๋กœ ์œ ์ง€๋˜๋Š” ๋™์•ˆ 1.6GB ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฐจ์ง€ํ•ฉ๋‹ˆ๋‹ค. num_workers=0์œผ๋กœ ์„ค์ •ํ•˜๋ฉด ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

๋˜‘๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š”๋ฐ ์ •ํ™•ํžˆ ํ•œ ์—ํฌํฌ ํ›„์— ์ •์ง€๋˜์ง€๋งŒ ๋” ์ž‘์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ๋Š” ์‹ค์ œ๋กœ ์žฌํ˜„ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. Docker ํ™˜๊ฒฝ์—์„œ CUDA 9.1, Pytorch 0.3.1, Python 3.6์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
@jph00 ์˜ Dataloader๋ฅผ ์‹œ๋„ํ–ˆ์ง€๋งŒ ๋‚ด ์‚ฌ์šฉ

Ubuntu 17.10, CUDA 9.1, Pytorch ๋งˆ์Šคํ„ฐ(19/04 ์•„์นจ ์ปดํŒŒ์ผ)์—์„œ ์ •ํ™•ํžˆ ๋™์ผํ•œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๋‚ด Dataset ํ•˜์œ„ ํด๋ž˜์Šค์—์„œ OpenCV๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‹ค์ค‘ ์ฒ˜๋ฆฌ ์‹œ์ž‘ ๋ฐฉ๋ฒ•์„ 'forkserver'์—์„œ 'spawn'์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ํ”ผํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

# Set multiprocessing start method - deadlock
set_start_method(forkserver')

# Set multiprocessing start method - runs fine
set_start_method('spawn')

์œ„์˜ ๋ชจ๋“  ์ ‘๊ทผ ๋ฐฉ์‹์„ ๊ฑฐ์˜ ์‹œ๋„ํ–ˆ์Šต๋‹ˆ๋‹ค! ๊ทธ๋“ค ์ค‘ ๋ˆ„๊ตฌ๋„ ์ผํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค!
์ด ๋ฌธ์ œ๋Š” ํ•˜๋“œ์›จ์–ด ์•„ํ‚คํ…์ฒ˜์™€์˜ ์ผ๋ถ€ ๋น„ํ˜ธํ™˜์„ฑ๊ณผ ๊ด€๋ จ์ด ์žˆ์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ Pytorch๊ฐ€ ์ด๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค! Pytorch ๋ฌธ์ œ์ผ ์ˆ˜๋„ ์žˆ๊ณ  ์•„๋‹ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค!

์ œ ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋œ ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
_BIOS๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค!

ํ•œ๋ฒˆ ํ•ด๋ณด์„ธ์š”. ์ ์–ด๋„ ๋‚ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค. ์šฐ๋ถ„ํˆฌ ํŒŒ์ดํ† ์น˜ 0.4, ํŒŒ์ด์ฌ 3.6.

๋ฌธ์ œ๋Š” ์—ฌ์ „ํžˆ pytorch 0.4 ๋ฐ python 3.6์— ์กด์žฌํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. pytorch ๋ฌธ์ œ์ธ์ง€ ํ™•์‹คํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” opencv๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  num_workers=8 , pin_memory=True . ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋ชจ๋“  ํŠธ๋ฆญ์„ ์‹œ๋„ํ•˜๊ณ  cv2.setNumThreads(0) ๋ฅผ ์„ค์ •ํ•˜๋ฉด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋ฉ๋‹ˆ๋‹ค.

(1) PyTorch ๋ฐ์ดํ„ฐ ๋กœ๋“œ์—์„œ num_workers=0์„ ์„ค์ •ํ•˜๋ฉด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋ฉ๋‹ˆ๋‹ค(์œ„ ์ฐธ์กฐ) ๋˜๋Š”
(2) cv2.setNumThreads(0)์€ ํ•ฉ๋ฆฌ์ ์œผ๋กœ ํฐ num_workers๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ์—๋„ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ์ผ์ข…์˜ ์Šค๋ ˆ๋“œ ์ž ๊ธˆ ๋ฌธ์ œ์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค.

๋‚˜๋Š” cv2.setNumThreads(0)์„ ๋‚ด ์ฃผ์š” ํŒŒ์ด์ฌ ํŒŒ์ผ์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์œผ๋กœ ์„ค์ •ํ–ˆ๊ณ  ๊ทธ ์ดํ›„๋กœ ์ด ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ ์ ์ด ์—†์Šต๋‹ˆ๋‹ค.

์˜ˆ, ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์˜ ๋Œ€๋ถ€๋ถ„์€ ํƒ€์‚ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ํฌํฌ๋กœ๋ถ€ํ„ฐ ์•ˆ์ „ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ํ•œ ๊ฐ€์ง€ ๋Œ€์•ˆ์€ ์Šคํฐ ์‹œ์ž‘ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ œ ๊ฒฝ์šฐ์—๋Š” ๋ชจ๋ธ์„ nn.DataParallel๋กœ ๋ž˜ํ•‘ํ•˜๊ณ  ๋ฐ์ดํ„ฐ ๋กœ๋”์—์„œ num_workers > 0์„ ์‚ฌ์šฉํ•  ๋•Œ ๊ต์ฐฉ ์ƒํƒœ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. nn.DataParallel ๋ž˜ํผ๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด ์ž ๊ธˆ ์—†์ด ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
CUDA_VISIBLE_DEVICES=0 ํŒŒ์ด์ฌ myscript.py --split 1
CUDA_VISIBLE_DEVICES=1 ํŒŒ์ด์ฌ myscript.py --split 2

๋‹ค์ค‘ GPU๊ฐ€ ์—†์œผ๋ฉด ์Šคํฌ๋ฆฝํŠธ๊ฐ€ ๋” ๋Š๋ฆฌ๊ฒŒ ์‹คํ–‰๋˜์ง€๋งŒ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๋‹ค๋ฅธ ๋ถ„ํ• ์—์„œ ๋™์‹œ์— ์—ฌ๋Ÿฌ ์‹คํ—˜์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Python 3.6.2 / Pytorch 0.4.0์—์„œ ๋™์ผํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋ฆฌ๊ณ  ๋ฌด์—‡๋ณด๋‹ค๋„ pin_memory ์ „ํ™˜, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ ๋ณ€๊ฒฝ, ๊ทธ๋ฆฌ๊ณ  Skiamge ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ(์ €๋Š” cv2๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค!!)๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ํ–ˆ์ง€๋งŒ ์—ฌ์ „ํžˆ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ฌธ์ œ๋Š” ๋ฌด์ž‘์œ„๋กœ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ์ œ์–ดํ•˜๋Š” โ€‹โ€‹๊ฒƒ์€ ์ฝ˜์†”์„๋ณด๊ณ  ํ›ˆ๋ จ์„ ๋‹ค์‹œ ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@jinh574 ๋ฐฉ๊ธˆ ๋ฐ์ดํ„ฐ ๋กœ๋” ์ž‘์—…์ž ์ˆ˜๋ฅผ 0์œผ๋กœ ์„ค์ •

@Shuailong ํฐ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์†๋„ ๋•Œ๋ฌธ์— ํ•ด๋‹น ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ด ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋” ์กฐ์‚ฌํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค

Python 3.6 / Pytorch 0.4.0์—์„œ ๋™์ผํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. pin_memory ์˜ต์…˜์ด ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๊นŒ?

collate_fn ๋ฐ num_workers>0์„ PyTorch ๋ฒ„์ „ < 0.4์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ:

__getitem__() ํ•จ์ˆ˜์—์„œ ZERO DIM ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š๋„๋ก ํ•˜์‹ญ์‹œ์˜ค.
๋˜๋Š” NUMPY ์–ด๋ ˆ์ด๋กœ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

num_workers=0 ๋˜๋Š” cv2.setNumThreads(0)์„ ์„ค์ •ํ•œ ํ›„์—๋„ ํ•ด๋‹น ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค. ๊ฐ™์€ ์ผ์— ์ง๋ฉดํ•œ ๋‹ค๋ฅธ ์‚ฌ๋žŒ์ด ์žˆ์Šต๋‹ˆ๊นŒ?

์—ญ์ถ”์ (๊ฐ€์žฅ ์ตœ๊ทผ ํ˜ธ์ถœ ๋งˆ์ง€๋ง‰):
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/runpy.py", 193ํ–‰, _run_module_as_main
"__main__", mod_spec)
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/runpy.py", 85ํ–‰, _run_code
exec(์ฝ”๋“œ, run_globals)
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/distributed/launch.py", 209ํ–‰,
๊ธฐ๋ณธ()
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/distributed/launch.py", ๋ผ์ธ 205, ๋ฉ”์ธ
ํ”„๋กœ์„ธ์Šค.๋Œ€๊ธฐ()
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/subprocess.py", ๋ผ์ธ 1457, ๋Œ€๊ธฐ ์ค‘
(pid, sts) = self._try_wait(0)
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/subprocess.py", 1404ํ–‰, _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
ํ‚ค๋ณด๋“œ ์ธํ„ฐ๋ŸฝํŠธ

ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/process.py", 258ํ–‰, _bootstrap
self.run()
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/process.py", ๋ผ์ธ 93, ์‹คํ–‰ ์ค‘
self._target( self._args, * self._kwargs)
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", 96ํ–‰, _worker_loop
r = index_queue.get(์‹œ๊ฐ„ ์ดˆ๊ณผ=MANAGER_STATUS_CHECK_INTERVAL)
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/queues.py", 104ํ–‰, get
self._poll(์‹œ๊ฐ„ ์ดˆ๊ณผ)์ด ์•„๋‹Œ ๊ฒฝ์šฐ:
์„ค๋ฌธ์กฐ์‚ฌ์—์„œ ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", 257ํ–‰
self._poll ๋ฐ˜ํ™˜(์‹œ๊ฐ„ ์ดˆ๊ณผ)
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", 414ํ–‰, _poll
r = ๋Œ€๊ธฐ([์ž์ฒด], ์‹œ๊ฐ„ ์ดˆ๊ณผ)
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", ๋ผ์ธ 911, ๋Œ€๊ธฐ ์ค‘
์ค€๋น„ = selector.select(์‹œ๊ฐ„ ์ดˆ๊ณผ)
ํŒŒ์ผ "/opt/conda/envs/pytorch-py3.6/lib/python3.6/selectors.py", 376ํ–‰, ์„ ํƒ
fd_event_list = self._poll.poll(์‹œ๊ฐ„ ์ดˆ๊ณผ)
ํ‚ค๋ณด๋“œ ์ธํ„ฐ๋ŸฝํŠธ

๋ฒ„์ „ '0.5.0a0+f57e4ce'๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š”๋ฐ ๋™์ผํ•œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ๋กœ๋”(num_workers=0)๋ฅผ ์ทจ์†Œํ•˜๊ฑฐ๋‚˜ cv2.setNumThreads(0)๋ฅผ ์„ค์ •ํ•˜๋ฉด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” #11985๊ฐ€ ๋ชจ๋“  ์ค‘๋‹จ์„ ์ œ๊ฑฐํ•ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๋‹นํžˆ ํ™•์‹ ํ•ฉ๋‹ˆ๋‹ค(๋‹น์‹ ์ด ์šฐ๋ฆฌ๊ฐ€ ํ†ต์ œํ•  ์ˆ˜ ์—†๋Š” ๋ถˆํ–‰ํ•œ ์‹œ๊ฐ„์— ๋ฐฉํ•ดํ•˜์ง€ ์•Š๋Š” ํ•œ). ์ด์ œ ๋ณ‘ํ•ฉ๋˜์—ˆ์œผ๋ฏ€๋กœ ์ด๊ฒƒ์„ ๋‹ซ์Šต๋‹ˆ๋‹ค.

cv2๊ฐ€ ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์‹ฑ์—์„œ ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— cv2์˜ ์ค‘๋‹จ๋„ ์ œ์–ดํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

torch_nightly-1.0.0.dev20181029 ํ˜„์žฌ๊นŒ์ง€๋„ ์ด๊ฒƒ์„ ๊ฒฝํ—˜ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ ์•„์ง PR์ด ๋ณ‘ํ•ฉ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๊นŒ?

@Evpok ์ด๊ฒƒ์€ ๊ฑฐ๊ธฐ์— ๋ณ‘ํ•ฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ํŒจ์น˜๊ฐ€ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋” ์ด์ƒ ๋‚จ์•„์žˆ๋Š” ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๊ฐ€๋Šฅํ•œ์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์‰ฌ์šด ์žฌํ˜„์ด ์žˆ์Šต๋‹ˆ๊นŒ?

์‹ค์ œ๋กœ ์ œ ์ชฝ์—์„œ ๊ด€๋ จ์—†๋Š” ๋‹ค์ค‘ ์ฒ˜๋ฆฌ ์—‰๋ง์œผ๋กœ ์ถ”์ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ถˆํŽธ์„ ๋“œ๋ ค ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค.

์•ˆ๋…•ํ•˜์„ธ์š” @Evpok
๋‚˜๋Š” torch_nightly-1.0.0 ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๊นŒ?

collate_fn ๋ฐ num_workers>0์„ PyTorch ๋ฒ„์ „ < 0.4์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ:

__getitem__() ํ•จ์ˆ˜์—์„œ ZERO DIM ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š๋„๋ก ํ•˜์‹ญ์‹œ์˜ค.
๋˜๋Š” NUMPY ์–ด๋ ˆ์ด๋กœ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

0 ํฌ๋ฏธํ•œ ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ฒ„๊ทธ๋ฅผ ์ˆ˜์ •ํ–ˆ๋Š”๋ฐ ๋ฌธ์ œ๊ฐ€ ์—ฌ์ „ํžˆ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

@zimenglan-sysu-512 ์ฃผ์š” ๋ฌธ์ œ๋Š” ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์‹ฑ์˜ ํ•œ๊ณ„์˜€์Šต๋‹ˆ๋‹ค. spawn ๋˜๋Š” forkserver (CPU-GPU ํ†ต์‹ ์— ํ•„์š”)๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ํ”„๋กœ์„ธ์Šค ๊ฐ„์— ๊ฐ์ฒด๋ฅผ ๊ณต์œ ํ•˜๋Š” ๊ฒƒ์€ ๋‹ค์†Œ ์ œํ•œ์ ์ด๋ฉฐ ๊ทธ๋ ‡์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋‚ด๊ฐ€ ์กฐ์ž‘ํ•ด์•ผ ํ•˜๋Š” ๊ฐœ์ฒด์˜ ์ข…๋ฅ˜์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

์ด ์ค‘ ์–ด๋Š ๊ฒƒ๋„ ๋‚˜๋ฅผ ์œ„ํ•ด ์ผํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ตœ์‹  opencv๋Š” ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค( 3.4.0.12 ~ 3.4.3.18 ๋ณ€๊ฒฝํ•  ์‚ฌํ•ญ ์—†์Œ).
sudo pip3 install --upgrade opencv-python

@see-- opencv๊ฐ€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ๊ฒŒ ๋˜์–ด ๊ธฐ์ฉ๋‹ˆ๋‹ค. :)

์ €๋Š” python2.7๊ณผ ํ•จ๊ป˜ OpenCV 3.4.3.18์„ ์‚ฌ์šฉ ์ค‘์ด๋ฉฐ ์—ฌ์ „ํžˆ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์„ ๋ด…๋‹ˆ๋‹ค. :/

๋‹ค์Œ์„ ์‹œ๋„ํ•˜์‹ญ์‹œ์˜ค.

from torch.utils.data.dataloader import DataLoader

๋Œ€์‹ ์—

from torch.utils.data import DataLoader

์—ฌ๊ธฐ์—์„œ ์œ ํ˜• ๊ฒ€์‚ฌ์— ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.
https://github.com/pytorch/pytorch/blob/656b565a0f53d9f24547b060bd27aa67ebb89b88/torch/utils/data/dataloader.py#L816

๋‹ค์Œ์„ ์‹œ๋„ํ•˜์‹ญ์‹œ์˜ค.

from torch.utils.data.dataloader import DataLoader

๋Œ€์‹ ์—

from torch.utils.data import DataLoader

์—ฌ๊ธฐ์—์„œ ์œ ํ˜• ๊ฒ€์‚ฌ์— ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

pytorch/torch/utils/data/dataloader.py

656b565์˜ 816ํ–‰
super(DataLoader, self).__setattr__(attr, val)

์ด๊ฒƒ์€ ๋‹จ์ง€ ๋ณ„์นญ์ด ์•„๋‹™๋‹ˆ๊นŒ? Torch.utils.data.__init__์—์„œ dataloader.DataLoader๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

๋‚˜๋Š” ๋˜ํ•œ num_workers> 0์œผ๋กœ ๋งค๋‹ฌ๋ ค์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋‚ด ์ฝ”๋“œ์—๋Š” opencv๊ฐ€ ์—†์œผ๋ฉฐ /dev/shm ์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์€ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์œ„์˜ ์ œ์•ˆ ์‚ฌํ•ญ์ด ์ €์—๊ฒŒ ํšจ๊ณผ์ ์ด์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๋‚ด ์ˆ˜์ • ์‚ฌํ•ญ์€ numpy๋ฅผ 1.14.1์—์„œ 1.14.5๋กœ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ฒƒ์ด ์—ˆ์Šต๋‹ˆ๋‹ค.
conda install numpy=1.14.5
๋„์›€์ด ๋˜๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค.

ํ , ๋‚ด numpy ๋ฒ„์ „์€ 1.15.4์ด๋ฏ€๋กœ 1.14.5๋ณด๋‹ค ์ตœ์‹  ๋ฒ„์ „์ž…๋‹ˆ๋‹ค... ๊ทธ๋Ÿผ ๊ดœ์ฐฎ์„๊นŒ์š”?

ํ , ๋‚ด numpy ๋ฒ„์ „์€ 1.15.4์ด๋ฏ€๋กœ 1.14.5๋ณด๋‹ค ์ตœ์‹  ๋ฒ„์ „์ž…๋‹ˆ๋‹ค... ๊ทธ๋Ÿผ ๊ดœ์ฐฎ์„๊นŒ์š”?

Idk, numpy ์—…๋ฐ์ดํŠธ๋„ mkl์„ ์—…๋ฐ์ดํŠธํ–ˆ์Šต๋‹ˆ๋‹ค.

์–ด๋–ค mkl ๋ฒ„์ „์ด ์žˆ์Šต๋‹ˆ๊นŒ? ๊ด‘์‚ฐ์€ 2019.1(๋นŒ๋“œ 144)์ด๊ณ  ์ด๋ฆ„์— mkl์ด ํฌํ•จ๋œ ๊ธฐํƒ€ ํŒจํ‚ค์ง€๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

mkl ์„œ๋น„์Šค 1.1.2 py37he904b0f_5
mkl_fft 1.0.6 py37hd81dba3_0
mkl_random 1.0.2 py37hd81dba3_0

์–ด๋–ค mkl ๋ฒ„์ „์ด ์žˆ์Šต๋‹ˆ๊นŒ? ๊ด‘์‚ฐ์€ 2019.1(๋นŒ๋“œ 144)์ด๊ณ  ์ด๋ฆ„์— mkl์ด ํฌํ•จ๋œ ๊ธฐํƒ€ ํŒจํ‚ค์ง€๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

mkl ์„œ๋น„์Šค 1.1.2 py37he904b0f_5
mkl_fft 1.0.6 py37hd81dba3_0
mkl_random 1.0.2 py37hd81dba3_0

conda list | grep mkl
mkl                       2018.0.1             h19d6760_4
mkl-service               1.1.2            py36h17a0993_4

์ตœ์‹  pytorch์—์„œ ์—ฌ์ „ํžˆ ์ค‘๋‹จ์ด ํ‘œ์‹œ๋˜๋Š” ๊ฒฝ์šฐ ๋ฌธ์ œ๋ฅผ ์žฌํ˜„ํ•˜๋Š” ์งง์€ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ํฐ ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ฐ์‚ฌ ํ•ด์š”!

๋‚˜๋Š” ์—ฌ์ „ํžˆ ์ด ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ๋ณด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์žฌ์ƒ์‚ฐํ•˜๋Š” ์Šคํฌ๋ฆฝํŠธ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

pin_memory=True ์ด(๊ฐ€) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค.

pin_memory=True ์™€ ํ•จ๊ป˜ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. 70 epoch ํ›„์—๋„ ์—ฌ์ „ํžˆ ๋ฉˆ์ถฅ๋‹ˆ๋‹ค. ์ง€๊ธˆ๊นŒ์ง€ ๋‚˜๋ฅผ ์œ„ํ•ด ์ผํ•œ ๊ฒƒ์€ num_workers=0 ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด์ง€๋งŒ ๋ˆˆ์— ๋„๊ฒŒ ๋Š๋ฆฝ๋‹ˆ๋‹ค.

๋˜ํ•œ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค(๋งค์šฐ ๋ฌด์ž‘์œ„๋กœ ๋ฐœ์ƒ). pin_memory ์‹œ๋„ํ•˜๊ณ  Numpy๋ฅผ ์—…๋ฐ์ดํŠธํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์ปดํ“จํ„ฐ์—์„œ ์‹คํ–‰ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์žˆ๋Š” ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋”ฉ ๋Œ€์‹  ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์‹ฑ์„ ์‚ฌ์šฉํ•ด ๋ณด์‹ญ์‹œ์˜ค. ์ด๊ฒƒ์€ ๋‚˜๋ฅผ ์œ„ํ•ด ๋ฌธ์ œ๋ฅผ ์™„์ „ํžˆ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค (๊ทธ๋Ÿฐ๋ฐ GIL ๋•Œ๋ฌธ์— Python์—์„œ ๊ณ„์‚ฐ ์ง‘์•ฝ์  ์ธ ์ž‘์—…์—๋„ ๋” ์ข‹์Šต๋‹ˆ๋‹ค)

Pytorch1.0, Pillow5.0.0 numpy1.16.1 python3.6์—์„œ ๋™์ผํ•œ ์˜ค๋ฅ˜

๋‚˜๋Š” ๋˜ํ•œ ๊ฐ™์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. pin_memory=True ๋ฐ num_workers=0 . ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ž‘์€ ๋ถ€๋ถ„์„ ์‚ฌ์šฉํ•  ๋•Œ ์ด ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ๋งŒ ์‚ฌ์šฉํ•˜๋ฉด ์ด ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

ํŽธ์ง‘: ์‹œ์Šคํ…œ์„ ๊ฐ„๋‹จํžˆ ๋‹ค์‹œ ์‹œ์ž‘ํ•˜๋ฉด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ๋น„์Šทํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๋‹ค. ์ผ๋ถ€ ์ฝ”๋“œ์—์„œ ์ด ํ•จ์ˆ˜๋Š” (๊ฑฐ์˜ ํ•ญ์ƒ) d_iter.next()์—์„œ ์ค‘๋‹จ๋ฉ๋‹ˆ๋‹ค.

def get_next_batch(d_iter, loader):
    try:
        data, label = d_iter.next()
    except StopIteration:
        d_iter = iter(loader)
        data, label = d_iter.next()
    return data, label

๋‚˜๋ฅผ ์œ„ํ•ด ์ผํ•œ ํ•ดํ‚น์€์ด ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœ ํ•œ ํ›„ ์•ฝ๊ฐ„์˜ ์ง€์—ฐ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์—ˆ์Šต๋‹ˆ๋‹ค.

trn_X, trn_y = get_next_batch(train_data_iter, train_loader)
time.sleep(0.003)
val_X, val_y = get_next_batch(valid_data_iter, valid_loader)

์ง€์—ฐ์ด ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๊นŒ?

๋‚˜๋Š” ์—ฌ์ „ํžˆ ์ด ๋ฌธ์ œ๋ฅผ ๊ฒช๊ณ  ์žˆ๋‹ค. pytorch 1.0 ๋ฐ python 3.7 ์‚ฌ์šฉ. ์—ฌ๋Ÿฌ data_loader๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์ด ๋ฒ„๊ทธ๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. 3๊ฐœ ๋ฏธ๋งŒ์˜ data_loader๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๋‹จ์ผ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์ด ๋ฒ„๊ทธ๊ฐ€ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์‹œํ—˜์„ ๋งˆ์นœ:

  1. time.sleep(0.003)
  2. pin_memory=์ฐธ/๊ฑฐ์ง“
  3. num_workers=0/1
  4. Torch.utils.data.dataloader์—์„œ DataLoader ๊ฐ€์ ธ์˜ค๊ธฐ
  5. /proc/sys/kernel/shmmni์— 8192 ์“ฐ๊ธฐ
    ๊ทธ๋“ค ์ค‘ ๋ˆ„๊ตฌ๋„ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•ด๊ฒฐ์ฑ…์ด ์žˆ๋Š”์ง€ ๋ชจ๋ฅด์‹ญ๋‹ˆ๊นŒ?

๋‚ด ์†”๋ฃจ์…˜์€ ์ „์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ์—์„œ cv2.setNumThreads(0)์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
๋‚˜๋Š” ๊ธฐ์ฐจ์™€ ๋ฐœ์„์œ„ํ•œ 2 ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ•œ ๋ฒˆ๋งŒ ํ‰๊ฐ€์ž๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐฉ๊ธˆ pytorch 1.1์—์„œ ์ด ๋ฒ„๊ทธ๋ฅผ ๋งŒ๋‚ฌ์Šต๋‹ˆ๋‹ค. 99๋ฒˆ์งธ epoch์˜ ๋์—์„œ ๊ฐ™์€ ์œ„์น˜์— ๋‘ ๋ฒˆ ๋ฉˆ์ท„์Šต๋‹ˆ๋‹ค. pin_memory ์ด False ๋กœ ์„ค์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ž‘์—…์ž> 0์„ ์‚ฌ์šฉํ•  ๋•Œ๋„ ๋™์ผํ•œ ๋ฌธ์ œ, ํ•€ ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

๋‚ด ์†”๋ฃจ์…˜์€ ์ „์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ์—์„œ cv2.setNumThreads(0)์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
๋‚˜๋Š” ๊ธฐ์ฐจ์™€ ๋ฐœ์„์œ„ํ•œ 2 ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ•œ ๋ฒˆ๋งŒ ํ‰๊ฐ€์ž๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์†”๋ฃจ์…˜์€ ์ €์—๊ฒŒ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค. ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋กœ๋”๋Š” ๋‚ด๊ฐ€ Epoch๋ฅผ ๋งˆ์น˜๋ฉด ์ค‘์ง€๋˜๊ณ  ์ƒˆ Epoch๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ๋งŒ๋‚œ๋‹ค. ์ œ ๊ฒฝ์šฐ์—๋Š” opencv-python์„ ์„ค์น˜ํ•  ๋•Œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค(์ด์ „์— opencv3์„ ์„ค์น˜ํ–ˆ์Šต๋‹ˆ๋‹ค). opencv-python์„ ์ด๋™ํ•œ ํ›„ ๊ต์œก์ด ์ค‘์ง€๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๊ทธ๊ฒƒ๋„ ์ข‹์€ ์ƒ๊ฐ์ด์•ผ

2019-06-20 10:51:02์—์„œ "hongzhenwang" [email protected]์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ผ์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋กœ๋”๋Š” ๋‚ด๊ฐ€ Epoch๋ฅผ ๋งˆ์น˜๋ฉด ์ค‘์ง€๋˜๊ณ  ์ƒˆ Epoch๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ๋งŒ๋‚œ๋‹ค. ์ œ ๊ฒฝ์šฐ์—๋Š” opencv-python์„ ์„ค์น˜ํ•  ๋•Œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค(์ด์ „์— opencv3์„ ์„ค์น˜ํ–ˆ์Šต๋‹ˆ๋‹ค). opencv-python์„ ์ด๋™ํ•œ ํ›„ ๊ต์œก์ด ์ค‘์ง€๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

โ€”
๋‹น์‹ ์ด ๋Œ“๊ธ€์„ ๋‹ฌ์•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ฑฐ๋‚˜ GitHub์—์„œ ๋ณด๊ฑฐ๋‚˜ ์Šค๋ ˆ๋“œ๋ฅผ ์Œ์†Œ๊ฑฐํ•˜์„ธ์š”.

๋‚˜๋Š” ์—ฌ์ „ํžˆ ์ด ๋ฌธ์ œ๋ฅผ ๊ฒช๊ณ  ์žˆ๋‹ค. pytorch 1.0 ๋ฐ python 3.7 ์‚ฌ์šฉ. ์—ฌ๋Ÿฌ data_loader๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์ด ๋ฒ„๊ทธ๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. 3๊ฐœ ๋ฏธ๋งŒ์˜ data_loader๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๋‹จ์ผ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์ด ๋ฒ„๊ทธ๊ฐ€ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์‹œํ—˜์„ ๋งˆ์นœ:

1. time.sleep(0.003)

2. pin_memory=True/False

3. num_workers=0/1

4. from torch.utils.data.dataloader import DataLoader

5. writing 8192 to /proc/sys/kernel/shmmni
   None of them works. Don't know whether there is any solutions?

์—ฌ์ „ํžˆ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ์ฐพ์œผ๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ GPU์—์„œ ๋™์‹œ์— 2๊ฐœ์˜ ๋ณ‘๋ ฌ ํ”„๋กœ์„ธ์Šค๋ฅผ ์‹คํ–‰ํ•  ๋•Œ๋งŒ ์ด ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค๋Š” ๋ฐ ๋™์˜ํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” ๊ณ„์† ์ง„ํ–‰๋˜๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ๋ฉˆ์ถฅ๋‹ˆ๋‹ค.

num_workers=4๋กœ ์„ค์ •ํ•˜๋ฉด ํ”„๋กœ๊ทธ๋žจ์ด ๋ฐฐ์น˜ 4๊ฐœ๋งˆ๋‹ค ๋ช‡ ์ดˆ(๋˜๋Š” ๋ช‡ ๋ถ„) ๋™์•ˆ ์ค‘๋‹จ๋˜์–ด ๋งŽ์€ ์‹œ๊ฐ„์„ ๋‚ญ๋น„ํ•ฉ๋‹ˆ๋‹ค. ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์•„์ด๋””์–ด๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ?

๋ฐ์ดํ„ฐ ๋กœ๋”์— pin_memory=True ๋ฐ num_workers=0 ํ”Œ๋ž˜๊ทธ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค!

๋ฐ์ดํ„ฐ ๋กœ๋”์— pin_memory=True ๋ฐ num_workers=0 ํ”Œ๋ž˜๊ทธ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค!
@ArturoDeza
์ด๊ฒƒ์ด ํ•ด๊ฒฐ์ฑ…์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ num_workers=0์œผ๋กœ ์„ค์ •ํ•˜๋ฉด CPU์˜ ์ „์ฒด ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ ์†๋„๊ฐ€ ๋Š๋ ค์ง€๊ณ  GPU ์‚ฌ์šฉ๋ฅ ์ด ๋งค์šฐ ๋‚ฎ์•„์ง‘๋‹ˆ๋‹ค.

์ €์—๊ฒŒ ๊ทธ ์ด์œ ๋Š” ์‹œ์Šคํ…œ์— CPU๊ฐ€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š๊ฑฐ๋‚˜ Dataloader์— ์ง€์ •๋œ num_workers ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋กœ๋”์˜ __get_item__ ๋ฉ”์†Œ๋“œ๊ฐ€ numpy , librosa ๋˜๋Š” opencv ์™€ ๊ฐ™์€ ์Šค๋ ˆ๋“œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ ๋กœ๋” ์ž‘์—…์ž์—์„œ ์Šค๋ ˆ๋”ฉ์„ ๋น„ํ™œ์„ฑํ™”ํ•˜๋Š” ๊ฒƒ๋„ ์ข‹์€ ์ƒ๊ฐ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค numpy opencv (์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ ๋Š” ์•„๋ž˜ ์ฐธ์กฐ). ์ด๊ฒƒ์€ OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py ํ›ˆ๋ จ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ์„ค๋ช…์„ ๋ช…ํ™•ํžˆ ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ Dataloader ๋ฐฐ์น˜๋Š” ๋‹จ์ผ ์ž‘์—…์ž๊ฐ€ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์ž‘์—…์ž๋Š” batch_size ์ƒ˜ํ”Œ์„ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋‹จ์ผ ๋ฐฐ์น˜๋ฅผ ์™„๋ฃŒํ•œ ๋‹ค์Œ ์ƒˆ ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

num_workers ๋จธ์‹ (๋˜๋Š” Kubernetes๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ํฌ๋“œ)์˜ CPU ์ˆ˜๋ณด๋‹ค ๋‚ฎ๊ฒŒ ์„ค์ •ํ•ด์•ผ ํ•˜์ง€๋งŒ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค์Œ ๋ฐ˜๋ณต์„ ์œ„ํ•ด ํ•ญ์ƒ ์ค€๋น„๋  ์ˆ˜ ์žˆ์„ ๋งŒํผ ์ถฉ๋ถ„ํžˆ ๋†’๊ฒŒ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. GPU๊ฐ€ t ์ดˆ ์•ˆ์— ๊ฐ ๋ฐ˜๋ณต์„ ์‹คํ–‰ํ•˜๊ณ  ๊ฐ ๋ฐ์ดํ„ฐ ๋กœ๋” ์ž‘์—…์ž๊ฐ€ ๋‹จ์ผ ๋ฐฐ์น˜๋ฅผ ๋กœ๋“œ/์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ N*t ์ดˆ๊ฐ€ ๊ฑธ๋ฆฐ๋‹ค๋ฉด num_workers ๋ฅผ N ์ด์ƒ์œผ๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค N CPU๊ฐ€ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ถˆํ–‰ํžˆ๋„ Dataloader๊ฐ€ K ์Šค๋ ˆ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์ƒ์„ฑ๋˜๋Š” ํ”„๋กœ์„ธ์Šค ์ˆ˜๋Š” num_workers*K = N*K ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹œ์Šคํ…œ์˜ CPU ์ˆ˜๋ณด๋‹ค ํ›จ์”ฌ ๋งŽ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ํฌ๋“œ๊ฐ€ ์ œํ•œ๋˜๊ณ  Dataloader๊ฐ€ ๋งค์šฐ ๋Š๋ ค์ง‘๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด Dataloader๊ฐ€ t์ดˆ๋งˆ๋‹ค ๋ฐฐ์น˜๋ฅผ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š์•„ GPU๊ฐ€ ์ค‘๋‹จ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

K ์Šค๋ ˆ๋“œ๋ฅผ ํ”ผํ•˜๋Š” ํ•œ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์€ OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py ๋ฉ”์ธ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ฐ Dataloader ์ž‘์—…์ž๊ฐ€ ๋‹จ์ผ ์Šค๋ ˆ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋„๋ก ์ œํ•œํ•˜๊ณ  ์‹œ์Šคํ…œ์— ๊ณผ๋ถ€ํ•˜๊ฐ€ ๊ฑธ๋ฆฌ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. GPU ๊ณต๊ธ‰์„ ์œ ์ง€ํ•˜๋ ค๋ฉด ์—ฌ์ „ํžˆ ์ถฉ๋ถ„ํ•œ num_workers ๊ฐ€ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ __get_item__ ์—์„œ ์ฝ”๋“œ๋ฅผ ์ตœ์ ํ™”ํ•˜์—ฌ ๊ฐ ์ž‘์—…์ž๊ฐ€ ์งง์€ ์‹œ๊ฐ„์— ๋ฐฐ์น˜๋ฅผ ์™„๋ฃŒํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ž‘์—…์ž๊ฐ€ ๋ฐฐ์น˜ ์ „์ฒ˜๋ฆฌ๋ฅผ ์™„๋ฃŒํ•˜๋Š” ๋ฐ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์ด ๋””์Šคํฌ์—์„œ ๊ต์œก ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๋Š” ์‹œ๊ฐ„(ํŠนํžˆ ๋„คํŠธ์›Œํฌ ์ €์žฅ์†Œ์—์„œ ์ฝ๋Š” ๊ฒฝ์šฐ) ๋˜๋Š” ๋„คํŠธ์›Œํฌ ๋Œ€์—ญํญ(๋„คํŠธ์›Œํฌ์—์„œ ์ฝ๋Š” ๊ฒฝ์šฐ)์— ์˜ํ•ด ๋ฐฉํ•ด๋ฐ›์ง€ ์•Š๋Š”์ง€ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค. ๋””์Šคํฌ). ๋ฐ์ดํ„ฐ์„ธํŠธ๊ฐ€ ์ž‘๊ณ  RAM์ด ์ถฉ๋ถ„ํ•œ ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ RAM(๋˜๋Š” /tmpfs )์œผ๋กœ ์ด๋™ํ•˜๊ณ  ๋น ๋ฅธ ์•ก์„ธ์Šค๋ฅผ ์œ„ํ•ด ์ฝ์–ด๋ณด์„ธ์š”. Kubernetes์˜ ๊ฒฝ์šฐ RAM ๋””์Šคํฌ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(Kubernetes์—์„œ emptyDir ๊ฒ€์ƒ‰).

__get_item__ ์ฝ”๋“œ๋ฅผ ์ตœ์ ํ™”ํ•˜๊ณ  ๋””์Šคํฌ ์•ก์„ธ์Šค/๋„คํŠธ์›Œํฌ ์•ก์„ธ์Šค๊ฐ€ ์›์ธ์ด ์•„๋‹˜์„ ํ™•์ธํ–ˆ์ง€๋งŒ ์—ฌ์ „ํžˆ ์ค‘๋‹จ์ด ํ‘œ์‹œ๋˜๋Š” ๊ฒฝ์šฐ ๋” ๋งŽ์€ CPU(Kubernetes ํฌ๋“œ์šฉ)๋ฅผ ์š”์ฒญํ•˜๊ฑฐ๋‚˜ GPU๋ฅผ ๋‹ค์Œ์œผ๋กœ ์ด๋™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. CPU๊ฐ€ ๋” ๋งŽ์€ ๋จธ์‹ .

๋˜ ๋‹ค๋ฅธ ์˜ต์…˜์€ batch_size ๋ฅผ ์ค„์—ฌ ๊ฐ worker ๊ฐ€ ํ•ด์•ผ ํ•  ์ž‘์—…์„ ์ค„์ด๊ณ  ์‚ฌ์ „ ์ฒ˜๋ฆฌ๋ฅผ ๋” ๋นจ๋ฆฌ ์™„๋ฃŒํ•˜๋„๋ก ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์œ ํœด GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์‚ฌ์šฉ๋˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ํ›„์ž์˜ ์˜ต์…˜์€ ๊ฒฝ์šฐ์— ๋”ฐ๋ผ ๋ฐ”๋žŒ์งํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ผ๋ถ€ ์‚ฌ์ „ ์ฒ˜๋ฆฌ๋ฅผ ์˜คํ”„๋ผ์ธ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ฐ ์ž‘์—…์ž์˜ ๋ถ€๋‹ด์„ ๋œ์–ด์ค„ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ฐ ์ž‘์—…์ž๊ฐ€ wav ํŒŒ์ผ์„ ์ฝ๊ณ  ์˜ค๋””์˜ค ํŒŒ์ผ์— ๋Œ€ํ•œ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒฝ์šฐ ์˜คํ”„๋ผ์ธ์—์„œ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์„ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•˜๊ณ  ์ž‘์—…์ž์˜ ๋””์Šคํฌ์—์„œ ๊ณ„์‚ฐ๋œ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์„ ์ฝ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ฐ ์ž‘์—…์ž๊ฐ€ ํ•ด์•ผ ํ•˜๋Š” ์ž‘์—…์˜ ์–‘์ด ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.

horovod์™€ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ๋งŒ๋‚˜๋‹ค

๋น„์Šทํ•œ ๋ฌธ์ œ๋ฅผ ๋งŒ๋‚˜๋ณด์„ธ์š”... ์—ํฌํฌ๋ฅผ ๋๋‚ด๊ณ  ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ๋ฅผ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ธฐ ์‹œ์ž‘ํ•˜๋Š” ๋™์•ˆ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค...

@jinhou @jackroos horovod ๋กœ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์‹œ์ž‘ ์‹œ ๋ฌด์ž‘์œ„๋กœ ๋ฉˆ์ท„์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ์ž‘์—…์€ ์‹œ๊ฐ„ ์ดˆ๊ณผ๋ฅผ ์„ค์ •ํ•˜๊ณ  ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•ด๊ฒฐ์ฑ…์ด ์žˆ์Šต๋‹ˆ๊นŒ?

@jinhou @jackroos horovod ๋กœ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์‹œ์ž‘ ์‹œ ๋ฌด์ž‘์œ„๋กœ ๋ฉˆ์ท„์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ์ž‘์—…์€ ์‹œ๊ฐ„ ์ดˆ๊ณผ๋ฅผ ์„ค์ •ํ•˜๊ณ  ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•ด๊ฒฐ์ฑ…์ด ์žˆ์Šต๋‹ˆ๊นŒ?

์•„๋‹ˆ์š”. ์ด ๊ฒฝ์šฐ ๋ถ„์‚ฐ ๊ต์œก์„ ๋•๋‹ˆ๋‹ค.

๋น„์Šทํ•œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•œ ์—ํฌํฌ๋ฅผ ๋งˆ์น˜๋ฉด ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์ค‘์ง€๋˜๊ณ  ์ƒˆ ์—ํฌํฌ๊ฐ€ ์‹œ์ž‘๋ฉ๋‹ˆ๋‹ค.

์ž”์ด ์™œ ์ด๋ ‡๊ฒŒ ๋งŽ์•„?

๋‚˜๋Š” ์—ฌ์ „ํžˆ ์ด ๋ฌธ์ œ๋ฅผ ๊ฒช๊ณ  ์žˆ๋‹ค. pytorch 1.0 ๋ฐ python 3.7 ์‚ฌ์šฉ. ์—ฌ๋Ÿฌ data_loader๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์ด ๋ฒ„๊ทธ๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. 3๊ฐœ ๋ฏธ๋งŒ์˜ data_loader๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๋‹จ์ผ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์ด ๋ฒ„๊ทธ๊ฐ€ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์‹œํ—˜์„ ๋งˆ์นœ:

  1. time.sleep(0.003)
  2. pin_memory=์ฐธ/๊ฑฐ์ง“
  3. num_workers=0/1
  4. Torch.utils.data.dataloader์—์„œ DataLoader ๊ฐ€์ ธ์˜ค๊ธฐ
  5. /proc/sys/kernel/shmmni์— 8192 ์“ฐ๊ธฐ
    ๊ทธ๋“ค ์ค‘ ๋ˆ„๊ตฌ๋„ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•ด๊ฒฐ์ฑ…์ด ์žˆ๋Š”์ง€ ๋ชจ๋ฅด์‹ญ๋‹ˆ๊นŒ?

0์œผ๋กœ ์„ค์ •๋œ num_workers๊ฐ€ ์ €์—๊ฒŒ ํšจ๊ณผ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  ๊ณณ์—์„œ 0์ธ์ง€ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค๋ฅธ ์ž ์žฌ์ ์ธ ์†”๋ฃจ์…˜:

  1. ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์‹ฑ ๊ฐ€์ ธ์˜ค๊ธฐ set_start_method์—์„œ
    set_start_method('์Šคํฐ')
  2. cv2.setNumThreads(0)

3~7๋ฒˆ ์ •๋„ ๊ฐ€๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” pytorch 1.3, ubuntu16์—์„œ ์ด ๋ฌธ์ œ๋ฅผ ๊ฒฝํ—˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ์œ„์˜ ๋ชจ๋“  ์ œ์•ˆ์€ ์‹คํ–‰์„ ๋Š๋ฆฌ๊ฒŒ ํ•˜๋Š” ์ž‘์—…์ž=0์„ ์ œ์™ธํ•˜๊ณ ๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ํ„ฐ๋ฏธ๋„์—์„œ ์‹คํ–‰ํ•  ๋•Œ๋งŒ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. Jupyter ๋…ธํŠธ๋ถ ๋‚ด์—์„œ๋Š” ์ž‘์—…์ž๊ฐ€ 32์ธ ๊ฒฝ์šฐ์—๋„ ๋ชจ๋“  ๊ฒƒ์ด ์ •์ƒ์ž…๋‹ˆ๋‹ค.

๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์€ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‹ค์‹œ ์—ด์–ด์•ผ ํ•ฉ๋‹ˆ๊นŒ? ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ๋ณด๊ณ ํ•˜์‹œ๋Š” ๋ถ„๋“ค๋„ ๋งŽ์ด ๋ณด์ด๋„ค์š”...

๋‚˜๋Š” ์—ฌ์ „ํžˆ ์ด ๋ฌธ์ œ๋ฅผ ๊ฒช๊ณ  ์žˆ๋‹ค. pytorch 1.0 ๋ฐ python 3.7 ์‚ฌ์šฉ. ์—ฌ๋Ÿฌ data_loader๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์ด ๋ฒ„๊ทธ๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. 3๊ฐœ ๋ฏธ๋งŒ์˜ data_loader๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ๋‹จ์ผ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์ด ๋ฒ„๊ทธ๊ฐ€ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์‹œํ—˜์„ ๋งˆ์นœ:

  1. time.sleep(0.003)
  2. pin_memory=์ฐธ/๊ฑฐ์ง“
  3. num_workers=0/1
  4. Torch.utils.data.dataloader์—์„œ DataLoader ๊ฐ€์ ธ์˜ค๊ธฐ
  5. /proc/sys/kernel/shmmni์— 8192 ์“ฐ๊ธฐ
    ๊ทธ๋“ค ์ค‘ ๋ˆ„๊ตฌ๋„ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•ด๊ฒฐ์ฑ…์ด ์žˆ๋Š”์ง€ ๋ชจ๋ฅด์‹ญ๋‹ˆ๊นŒ?

0์œผ๋กœ ์„ค์ •๋œ num_workers๊ฐ€ ์ €์—๊ฒŒ ํšจ๊ณผ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  ๊ณณ์—์„œ 0์ธ์ง€ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค๋ฅธ ์ž ์žฌ์ ์ธ ์†”๋ฃจ์…˜:

  1. ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์‹ฑ ๊ฐ€์ ธ์˜ค๊ธฐ set_start_method์—์„œ
    set_start_method('์Šคํฐ')
  2. cv2.setNumThreads(0)

3~7๋ฒˆ ์ •๋„ ๊ฐ€๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

train.py ๊ณผ ๊ฐ™์ด ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

from __future__ import division

import cv2
cv2.setNumThreads(0)

import argparse

...

๊ทธ๋ฆฌ๊ณ  ๊ทธ๊ฒƒ์€ ๋‚˜๋ฅผ ์œ„ํ•ด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๋Ÿฌ๋ถ„, ์ œ๊ฐ€ ๋„์šธ ์ˆ˜ ์žˆ๋‹ค๋ฉด
๋‚˜๋Š” ๋˜ํ•œ ์ด์™€ ์œ ์‚ฌํ•œ์ด ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์ง€๋งŒ 100 ์ •๋„๋งˆ๋‹ค ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ๊ทธ๊ฒƒ์ด CUDA๋ฅผ ํ™œ์„ฑํ™”ํ–ˆ์„ ๋•Œ๋งŒ ๋ฐœ์ƒํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•„์ฐจ๋ ธ๊ณ , ๋˜ํ•œ dmesg๋Š” ์ถฉ๋Œํ•  ๋•Œ๋งˆ๋‹ค ์ด ๋กœ๊ทธ ํ•ญ๋ชฉ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

python[11240]: segfault at 10 ip 00007fabdd6c37d8 sp 00007ffddcd64fd0 error 4 in libcudart.so.10.1.243[7fabdd699000+77000]

๊ทธ๊ฒƒ์€ ๋‚˜์—๊ฒŒ ํšก์„ค์ˆ˜์„คํ•˜์ง€๋งŒ CUDA์™€ python ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋”ฉ์ด ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๋งํ–ˆ์Šต๋‹ˆ๋‹ค.

๋‚ด ์ˆ˜์ •์€ ๋ฐ์ดํ„ฐ ์Šค๋ ˆ๋“œ์—์„œ cuda๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•˜๋Š” ๊ฒƒ์ด ์—ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ ๋‚ด ํŒŒ์ด์ฌ ํ•ญ๋ชฉ ํŒŒ์ผ์˜ ์Šค๋‹ˆํŽซ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

from multiprocessing import set_start_method
import os

if __name__ == "__main__":
  set_start_method('spawn')
else:
  os.environ["CUDA_VISIBLE_DEVICES"] = ""

import torch
import application

๋ฐ”๋ผ๊ฑด๋Œ€ ๊ทธ๊ฒƒ์€ ๋‚ด๊ฐ€ ๋‹น์‹œ ํ•„์š”๋กœ ํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ์—ฌ๊ธฐ์— ์ฐฉ๋ฅ™ํ•˜๋Š” ๋ˆ„๊ตฐ๊ฐ€๋ฅผ ๋„์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

@jinhou @jackroos horovod ๋กœ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์‹œ์ž‘ ์‹œ ๋ฌด์ž‘์œ„๋กœ ๋ฉˆ์ท„์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ์ž‘์—…์€ ์‹œ๊ฐ„ ์ดˆ๊ณผ๋ฅผ ์„ค์ •ํ•˜๊ณ  ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•ด๊ฒฐ์ฑ…์ด ์žˆ์Šต๋‹ˆ๊นŒ?

์•„๋‹ˆ์š”. ์ด ๊ฒฝ์šฐ ๋ถ„์‚ฐ ๊ต์œก์„ ๋•๋‹ˆ๋‹ค.

PyTorch 1.4๋กœ ์—…๋ฐ์ดํŠธํ•œ ํ›„ OpenCV๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ๋ถ„์‚ฐ ๊ต์œก์—์„œ ๋น„์Šทํ•œ ๋ฌธ์ œ๋ฅผ ๋งŒ๋‚ฌ์Šต๋‹ˆ๋‹ค.
์ด์ œ ํ›ˆ๋ จ ๋ฐ ๊ฒ€์ฆ ๋ฃจํ”„ ์ „์— ๊ฒ€์ฆ์„ ํ•œ ๋ฒˆ ์‹คํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ์ด๊ฒƒ ๋•Œ๋ฌธ์— ๋งŽ์€ ์–ด๋ ค์›€์„ ๊ฒช์—ˆ๋‹ค. pytorch ๋ฒ„์ „, python ๋ฒ„์ „ ๋ฐ ๋‹ค๋ฅธ ๋ฌผ๋ฆฌ์  ์‹œ์Šคํ…œ(๋™์ผํ•˜๊ฒŒ ์„ค์ •๋˜์—ˆ์„ ์ˆ˜ ์žˆ์Œ)์—์„œ ์ง€์†๋˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋งค๋ฒˆ ๊ฐ™์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/site-packages/bicep/loops.py", line 73, in __call__
    for data, target in self.dataloader:
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 830, in _next_data
    self._shutdown_workers()
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 942, in _shutdown_workers
    w.join()
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)

๋‚ด๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ์ปดํ“จํ„ฐ์—์„œ ํ”„๋กœ์„ธ์Šค๊ฐ€ ์ฒ˜๋ฆฌ๋˜๋Š” ๋ฐฉ์‹์— ๋ถ„๋ช…ํžˆ ๋ช‡ ๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. num_workers=0์„ ์„ค์ •ํ•˜๋Š” ๊ฒƒ ์™ธ์—๋Š” ์œ„์˜ ์†”๋ฃจ์…˜ ์ค‘ ์–ด๋Š ๊ฒƒ๋„ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ์ด๊ฒƒ์˜ ๋ฐ”๋‹ฅ์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๊ธฐ๋ฅผ ์ •๋ง๋กœ ์›ํ•ฉ๋‹ˆ๋‹ค. ๋ˆ„๊ตฌ๋“ ์ง€ ์ด๊ฒƒ์„ ์‹œ์ž‘ํ•˜๊ฑฐ๋‚˜ ์งˆ๋ฌธํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ?

๋‚˜๋„ ์—ฌ๊ธฐ.

ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/home/miniconda/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 65, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 95106) is killed by signal: Segmentation fault.

ํ•œ ๊ฐ€์ง€ ํฅ๋ฏธ๋กœ์šด ์ ์€

๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ์ค„์”ฉ ๊ตฌ๋ฌธ ๋ถ„์„ํ•˜๋ฉด์ด ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

        with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

ํ•˜์ง€๋งŒ ํ•œ ์ค„์”ฉ ์ฝ์€ ํ›„ JSON ๊ตฌ๋ฌธ ๋ถ„์„ ๋…ผ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ์ด ์˜ค๋ฅ˜๊ฐ€ ๋ณด๊ณ ๋ฉ๋‹ˆ๋‹ค.

with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

        json_data = []
        for line in all_data:
            try:
                json_data.append(json.loads(line))
            except:
                break
return json_data

์•ฝ๊ฐ„์˜ JSON ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ดํ•ดํ•˜์ง€๋งŒ ์ž‘์—…์ž ์ˆ˜๋ฅผ 2๋กœ ์ค„์ด๊ณ  ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ๋งค์šฐ ์ž‘์•„๋„ ์—ฌ์ „ํžˆ ๋™์ผํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” ๊ทธ๊ฒƒ์ด shm๊ณผ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€ ์˜์‹ฌํ•ฉ๋‹ˆ๋‹ค. ์–ด๋–ค ๋‹จ์„œ?

์ด ๋ฌธ์ œ๋ฅผ ๋‹ค์‹œ ์—ด์–ด ๋ณผ๊นŒ์š”?

์šฐ๋ฆฌ๋Š” ๊ทธ๋ž˜์•ผ๋งŒ ํ•ด. BTW, ์ผ๋ถ€ GDB ๋””๋ฒ„๊ทธ๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์ง€๋งŒ ์•„๋ฌด ๊ฒƒ๋„ ๋ฐœ๊ฒฌ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ์ธ์ง€ ํ™•์‹คํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

(gdb) run

Starting program: /home/miniconda/bin/python performance.py

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

[New Thread 0x7fffa60a6700 (LWP 61963)]

[New Thread 0x7fffa58a5700 (LWP 61964)]

[New Thread 0x7fffa10a4700 (LWP 61965)]

[New Thread 0x7fff9e8a3700 (LWP 61966)]

[New Thread 0x7fff9c0a2700 (LWP 61967)]

[New Thread 0x7fff998a1700 (LWP 61968)]

[New Thread 0x7fff970a0700 (LWP 61969)]

[New Thread 0x7fff9489f700 (LWP 61970)]

[New Thread 0x7fff9409e700 (LWP 61971)]

[New Thread 0x7fff8f89d700 (LWP 61972)]

[New Thread 0x7fff8d09c700 (LWP 61973)]

[New Thread 0x7fff8a89b700 (LWP 61974)]

[New Thread 0x7fff8809a700 (LWP 61975)]

[New Thread 0x7fff85899700 (LWP 61976)]

[New Thread 0x7fff83098700 (LWP 61977)]

[New Thread 0x7fff80897700 (LWP 61978)]

[New Thread 0x7fff7e096700 (LWP 61979)]

[New Thread 0x7fff7d895700 (LWP 61980)]

[New Thread 0x7fff7b094700 (LWP 61981)]

[New Thread 0x7fff78893700 (LWP 61982)]

[New Thread 0x7fff74092700 (LWP 61983)]

[New Thread 0x7fff71891700 (LWP 61984)]

[New Thread 0x7fff6f090700 (LWP 61985)]

[Thread 0x7fff7e096700 (LWP 61979) exited]

[Thread 0x7fff6f090700 (LWP 61985) exited]

[Thread 0x7fff74092700 (LWP 61983) exited]

[Thread 0x7fff7b094700 (LWP 61981) exited]

[Thread 0x7fff80897700 (LWP 61978) exited]

[Thread 0x7fff83098700 (LWP 61977) exited]

[Thread 0x7fff85899700 (LWP 61976) exited]

[Thread 0x7fff8809a700 (LWP 61975) exited]

[Thread 0x7fff8a89b700 (LWP 61974) exited]

[Thread 0x7fff8d09c700 (LWP 61973) exited]

[Thread 0x7fff8f89d700 (LWP 61972) exited]

[Thread 0x7fff9409e700 (LWP 61971) exited]

[Thread 0x7fff9489f700 (LWP 61970) exited]

[Thread 0x7fff970a0700 (LWP 61969) exited]

[Thread 0x7fff998a1700 (LWP 61968) exited]

[Thread 0x7fff9c0a2700 (LWP 61967) exited]

[Thread 0x7fff9e8a3700 (LWP 61966) exited]

[Thread 0x7fffa10a4700 (LWP 61965) exited]

[Thread 0x7fffa58a5700 (LWP 61964) exited]

[Thread 0x7fffa60a6700 (LWP 61963) exited]

[Thread 0x7fff71891700 (LWP 61984) exited]

[Thread 0x7fff78893700 (LWP 61982) exited]

[Thread 0x7fff7d895700 (LWP 61980) exited]

total_files = 5040.  //customer comments

[New Thread 0x7fff6f090700 (LWP 62006)]

[New Thread 0x7fff71891700 (LWP 62007)]

[New Thread 0x7fff74092700 (LWP 62008)]

[New Thread 0x7fff78893700 (LWP 62009)]

ERROR: Unexpected segmentation fault encountered in worker.

ERROR: Unexpected segmentation fault encountered in worker.

Traceback (most recent call last):

File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data

data = self._data_queue.get(timeout=timeout)

File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 104, in get

if not self._poll(timeout):

File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 257, in poll

return self._poll(timeout)

File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 414, in _poll

r = wait([self], timeout)

File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 911, in wait

ready = selector.select(timeout)

File "/home/miniconda/lib/python3.6/selectors.py", line 376, in select

fd_event_list = self._poll.poll(timeout)

File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler

_error_if_any_worker_fails()

RuntimeError: DataLoader worker (pid 62005) is killed by signal: Segmentation fault.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "performance.py", line 62, in <module>

main()

File "performance.py", line 48, in main

for i,batch in enumerate(rl_data_loader):

File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__

data = self._next_data()

File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data

idx, data = self._get_data()

File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data

success, data = self._try_get_data()

File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data

raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))

RuntimeError: DataLoader worker (pid(s) 62005) exited unexpectedly

[Thread 0x7fff78893700 (LWP 62009) exited]

[Thread 0x7fff74092700 (LWP 62008) exited]

[Thread 0x7fff71891700 (LWP 62007) exited]

[Thread 0x7fff6f090700 (LWP 62006) exited]

[Inferior 1 (process 61952) exited with code 01]

(gdb) backtrace

No stack.

๊ทธ๋ฆฌ๊ณ  ์ถฉ๋ถ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์ ์–ด๋„ segfault๊นŒ์ง€ ๊ฝค ์˜ค๋žœ ์‹œ๊ฐ„ ๋™์•ˆ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ถฉ๋ถ„ํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•˜์ง€๋งŒ ๋ฐ์ดํ„ฐ ๋กœ๋” ์ž‘์—…์„ ์‹œ์ž‘ํ•œ ์งํ›„์— ์„ธ๊ทธ๋จผํŠธ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

------ Messages Limits --------

max queues system wide = 32000

max size of message (bytes) = 8192

default max size of queue (bytes) = 16384

------ Shared Memory Limits --------

max number of segments = 4096

max seg size (kbytes) = 18014398509465599

max total shared memory (kbytes) = 18014398509481980

min seg size (bytes) = 1

------ Semaphore Limits --------

max number of arrays = 32000

max semaphores per array = 32000

max semaphores system wide = 1024000000

max ops per semop call = 500

semaphore max value = 32767

์•ˆ๋…•ํ•˜์„ธ์š” @soumith @apaszke , ์ด ๋ฌธ์ œ๋ฅผ ๋‹ค์‹œ ์—ด ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ ? shm ํฌ๊ธฐ ๋ฐ ์„ธ๊ทธ๋จผํŠธ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ์ œ์•ˆ๋œ ๋ชจ๋“  ์†”๋ฃจ์…˜์„ ์‹œ๋„ํ–ˆ์ง€๋งŒ ์•„๋ฌด ๊ฒƒ๋„ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ €๋Š” opencv ์ •๋„๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ๋‹จ์ˆœํ•œ JSON ๊ตฌ๋ฌธ ๋ถ„์„์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์—ฌ์ „ํžˆ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์—ฌ๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— shm๊ณผ ๊ด€๋ จ์ด ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์Šคํƒ ์ถ”์ ์€ ๋˜ํ•œ ์œ„์— ๊ฒŒ์‹œ๋œ ๊ฒƒ์ฒ˜๋Ÿผ ์•„๋ฌด ๊ฒƒ๋„ ํ‘œ์‹œํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

@apaszke , ๊ท€ํ•˜์˜ ์ œ์•ˆ์— ๋Œ€ํ•ด

"์˜ˆ, ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์˜ ๋Œ€๋ถ€๋ถ„์€ ํƒ€์‚ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ํฌํฌ๋กœ๋ถ€ํ„ฐ ์•ˆ์ „ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ํ•œ ๊ฐ€์ง€ ๋Œ€์•ˆ์€ ์ƒ์„ฑ ์‹œ์ž‘ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค."

๋ฐ์ดํ„ฐ ๋กœ๋” ๋‹ค์ค‘ ์ž‘์—…์ž๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐฉ๋ฒ•์„ ๋ณ€๊ฒฝํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ? set_start_method('spawn') ๋ฅผ main.py์— ์„ค์ •ํ•˜๊ณ  ์žˆ์ง€๋งŒ ๋„์›€์ด ๋˜์ง€ ์•Š๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ๋‹ค์ค‘ ์ž‘์—…์ž(๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค) ๋ฐ์ดํ„ฐ ๋กœ๋”๋ฅผ ํ™œ์„ฑํ™”ํ•˜๊ณ  ๊ธฐ๋ณธ ๊ต์œก์—์„œ https://pytorch.org/docs/stable/notes/multiprocessing.html ์—์„œ ์ œ์•ˆํ•œ ๋Œ€๋กœ ๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค๋ฅผ ์‹œ์ž‘ํ•˜๋Š” ๊ฒฝ์šฐ ์—ฌ๊ธฐ์— ์ผ๋ฐ˜์ ์ธ ์งˆ๋ฌธ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

pytorch๋Š” ๋ฐ์ดํ„ฐ ๋กœ๋”์™€ ๊ธฐ๋ณธ ๊ต์œก ๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค๋ฅผ ์–ด๋–ป๊ฒŒ ๊ด€๋ฆฌํ•ฉ๋‹ˆ๊นŒ? ๋ฉ€ํ‹ฐ ์ฝ”์–ด GPU์—์„œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ํ”„๋กœ์„ธ์Šค/์Šค๋ ˆ๋”ฉ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๊นŒ? ๋˜ํ•œ ๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค๋ฅผ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋Š” ๋ฐ์ดํ„ฐ ๋กœ๋”์™€ ๋ฉ”์ธ ํŠธ๋ ˆ์ด๋‹ ํ”„๋กœ์„ธ์Šค์— ์˜ํ•ด "๊ณต์œ "๋ฉ๋‹ˆ๊นŒ? JSON ๊ตฌ๋ฌธ ๋ถ„์„, CSV ๊ตฌ๋ฌธ ๋ถ„์„, ํŒฌ๋” ๊ธฐ๋Šฅ ์ถ”์ถœ๊ณผ ๊ฐ™์€ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ ์š”๋ฆฌ ์ž‘์—…์ด ์žˆ๋Š” ๊ฒฝ์šฐ์—๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค. ๋“ฑ, ์–ด๋””์— ๋„ฃ๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ข‹์€๊ฐ€? ๋ฐ์ดํ„ฐ ๋กœ๋”์—์„œ ๋ฐ์ดํ„ฐ ๋กœ๋” __get_item__ ๋ฅผ ๊ฐ€๋Šฅํ•œ ํ•œ ๊ฐ„๋‹จํ•˜๊ฒŒ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์œ„์—์„œ ์ œ์•ˆํ•œ ๋Œ€๋กœ ๋ฐ์ดํ„ฐ ๋กœ๋”์—์„œ ์™„๋ฒฝํ•˜๊ฒŒ ์ค€๋น„๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ๊ธฐ๋ณธ ๊ต์œก์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ๋ ‡๊ฒŒ ํ•˜์‹ญ์‹œ์˜ค.

@zhangruiskyline ๊ท€ํ•˜์˜ ๋ฌธ์ œ๋Š” ์‹ค์ œ๋กœ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์€ segfault์— ์˜ํ•ด ๋…ธ๋™์ž๋“ค์ด ์‚ดํ•ด๋‹นํ•˜๋Š” ๊ฒƒ์— ๊ด€ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. sigbus๋Š” shm ๋ฌธ์ œ๋ฅผ ์ œ์•ˆํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์„ธํŠธ ์ฝ”๋“œ๋ฅผ ํ™•์ธํ•˜๊ณ  ๊ฑฐ๊ธฐ์—์„œ ๋””๋ฒ„๊ทธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค๋ฅธ ์งˆ๋ฌธ์— ๋‹ตํ•˜๋ ค๋ฉด

  1. DataLoader์—์„œ kwarg multiproessing_context='spawn' ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์Šคํฐ์ด ์„ค์ •๋ฉ๋‹ˆ๋‹ค. set_start_method ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค.
  2. ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค ๊ต์œก์—์„œ ๊ฐ ํ”„๋กœ์„ธ์Šค์—๋Š” ๊ณ ์œ ํ•œ DataLoader๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ DataLoader ์ž‘์—…์ž๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ช…์‹œ์ ์œผ๋กœ ์ˆ˜ํ–‰๋˜์ง€ ์•Š๋Š” ํ•œ ํ”„๋กœ์„ธ์Šค ๊ฐ„์—๋Š” ์•„๋ฌด ๊ฒƒ๋„ ๊ณต์œ ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

@SsnL ๊ฐ์‚ฌ multiproessing_context='spawn' ํ–ˆ์ง€๋งŒ ๋™์ผํ•œ ์‹คํŒจ์ž…๋‹ˆ๋‹ค.

์ด์ „ ์Šค๋ ˆ๋“œ์—์„œ ์ง€์ ํ–ˆ์ง€๋งŒ ๋‚ด ์ฝ”๋“œ๋Š” ๋งค์šฐ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค.

  • ์ด ์ฝ”๋“œ ์กฐ๊ฐ์€ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค
        with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))
  • ํ•˜์ง€๋งŒ ํ•œ ์ค„์”ฉ ์ฝ์€ ํ›„ JSON ๊ตฌ๋ฌธ ๋ถ„์„ ๋…ผ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ์ด ์˜ค๋ฅ˜๊ฐ€ ๋ณด๊ณ ๋ฉ๋‹ˆ๋‹ค.
with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

        json_data = []
        for line in all_data:
            try:
                json_data.append(json.loads(line))
            except:
                break
return json_data

๊ทธ๋ž˜์„œ ๋‚˜๋Š” ๊ทธ๊ฒƒ์ด ๋‚ด ์ฝ”๋“œ ๋ฌธ์ œ์ธ์ง€ ์˜์‹ฌ์Šค๋Ÿฝ๊ณ  JSON ๊ตฌ๋ฌธ ๋ถ„์„์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ง์ ‘ ๋ฌธ์ž์—ด ๋ถ„ํ• , ๋™์ผํ•œ ๋ฌธ์ œ๋ฅผ ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋กœ๋”์—์„œ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์— ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๋Š” ๋…ผ๋ฆฌ๊ฐ€ ์žˆ๋Š” ํ•œ ์ด ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ

๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค ๊ต์œก, ๊ฐ ํ”„๋กœ์„ธ์Šค์—๋Š” ๊ณ ์œ ํ•œ DataLoader๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ DataLoader ์ž‘์—…์ž ๋ช…์‹œ์ ์œผ๋กœ ์ˆ˜ํ–‰๋˜์ง€ ์•Š๋Š” ํ•œ ํ”„๋กœ์„ธ์Šค ๊ฐ„์— ๊ณต์œ ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿผ ๊ฐ๊ฐ 8๊ฐœ์˜ ์ž‘์—…์ž ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์žˆ๋Š” 4๊ฐœ์˜ ๊ต์œก ํ”„๋กœ์„ธ์Šค๊ฐ€ ์žˆ๋Š”์ง€ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์— ์ด 32๊ฐœ์˜ ํ”„๋กœ์„ธ์Šค๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ?

@zhangruiskyline ๋ฌธ์ œ๋ฅผ ์žฌํ˜„ํ•˜๋Š” ์ž์ฒด ํฌํ•จ๋œ ์Šคํฌ๋ฆฝํŠธ๊ฐ€ ์—†์œผ๋ฉด ์šฐ๋ฆฌ๋Š” ๋‹น์‹ ์„ ๋„์šธ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์˜ˆ, 32๊ฐœ์˜ ํ”„๋กœ์„ธ์Šค๊ฐ€ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ณ ๋งˆ์›Œ, ๋‚˜๋Š” ๋˜ํ•œ ๋น„์Šทํ•œ ๋ฌธ์ œ๋ฅผ ๋ณด์•˜๋‹ค
https://github.com/pytorch/pytorch/issues/4969
https://github.com/pytorch/pytorch/issues/5040

๋‘˜ ๋‹ค ๋‹ซํ˜”์ง€๋งŒ ๋ช…ํ™•ํ•œ ํ•ด๊ฒฐ์ฑ…์ด๋‚˜ ์ˆ˜์ • ์‚ฌํ•ญ์ด ๋ณด์ด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ์—ฌ์ „ํžˆ ๊ด‘๋ฒ”์œ„ํ•œ ๊ธฐ์กด ๋ฌธ์ œ์ž…๋‹ˆ๊นŒ?

์ž์ฒด ํฌํ•จ๋œ ์žฌ์ƒ์„ฑ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ํ™•์ธํ•˜์ง€๋งŒ ์ด ์Šคํฌ๋ฆฝํŠธ๋Š” ๋‹น์‚ฌ ํ”Œ๋žซํผ ๋ฐ ๋ฐ์ดํ„ฐ ์†Œ์Šค์— ๊ณ ๋„๋กœ ํ†ตํ•ฉ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ ์‹œ๋„ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@zhangruiskyline ๋‹น์‹ ์ด ์ฝ์€ ๋ฌธ์ œ๋Š” ์—ฐ๊ฒฐ๋œ ๋ฌธ์ œ์™€ ์œ ์‚ฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ์Šค๋ ˆ๋“œ์— ๋Œ€ํ•ด ๋ณด๊ณ ๋œ ์›๋ž˜/๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ๋ฌธ์ œ๊ฐ€ ์ด๋ฏธ ํ•ด๊ฒฐ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‹ซํž™๋‹ˆ๋‹ค.

@SsnL ๊ฐ์‚ฌ Pytorch ์— ์ต์ˆ™ํ•˜์ง€ ์•Š์•„์„œ ํ‹€๋ฆด ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ทธ ๋ชจ๋“  ๊ฒƒ์„ ์‚ดํŽด๋ณด์•˜๊ณ , ๊ทธ ์ค‘ ์ผ๋ถ€๋Š” ๋‹ค์Œ์œผ๋กœ ํ•ด๊ฒฐ๋œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ์ž‘์—…์ž ์ˆ˜๋ฅผ 0์œผ๋กœ ์ค„์ด์‹ญ์‹œ์˜ค. ๋„ˆ๋ฌด ๋Š๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ํ—ˆ์šฉ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  • shm ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ์ง€ ๋งŒ ์ถฉ๋ถ„ํ•œ shm์ด ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๋ฌธ์ œ๋Š” ์‹œ์ž‘ํ•œ ์งํ›„์— ๊ฑฐ์˜ ๋ฐœ์ƒํ–ˆ์œผ๋ฉฐ ํ›จ์”ฌ ์ž‘์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ์‹œ๋„ํ•œ ๋ฌธ์ œ๋Š” ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

  • opencv์™€ ๊ฐ™์€ ์ผ๋ถ€ lib๋Š” ๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค์—์„œ ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” JSON/CSV๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ ์‹ค์ œ๋กœ ๋ฉ‹์ง„ ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค.

์šฐ๋ฆฌ์˜ ์ฝ”๋“œ๋Š” ๋งค์šฐ ๊ฐ„๋‹จํ•˜๊ณ  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—๋Š” 10000๊ฐœ ์ด์ƒ์˜ ํŒŒ์ผ์ด ์žˆ์œผ๋ฉฐ ๊ฐ ํŒŒ์ผ์€ ์—ฌ๋Ÿฌ ์ค„์˜ JSON ๋ฌธ์ž์—ด์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋กœ๋”์—์„œ __get_item__ ๋ฅผ ์ •์˜ํ•˜์—ฌ 10,000๊ฐœ ์ด์ƒ์˜ ํŒŒ์ผ์—์„œ ๊ฐ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ค๊ณ  ํ•ด๋‹น ํŒŒ์ผ์˜ ๋ชจ๋“  ์ฝ˜ํ…์ธ ๋ฅผ ์ฝ์Šต๋‹ˆ๋‹ค.

์†”๋ฃจ์…˜ 1์—์„œ ๋จผ์ € ํ–‰๋ณ„๋กœ JSON ๋ฌธ์ž์—ด ๋ชฉ๋ก์„ ์ฝ๊ณ  ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰์‹œ ๋ฐ˜ํ™˜ํ•˜๋ฉด ์ž‘๋™ํ•˜๋ฉฐ ์„ฑ๋Šฅ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

        with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))
            return all_data

์ด์ œ ๋ฐ˜ํ™˜๋œ ๊ฐ’์ด ์—ฌ์ „ํžˆ JSON ๋ฌธ์ž์—ด์ด๋ฏ€๋กœ ์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค ๋ฐ์ดํ„ฐ ๋กœ๋”๋ฅผ ํ™œ์šฉํ•˜๋ ค๊ณ  ํ•˜๋ฏ€๋กœ ์—ฌ๊ธฐ์— JSON ๊ตฌ๋ฌธ ๋ถ„์„ ๋…ผ๋ฆฌ๋ฅผ ๋„ฃ์œผ๋ฉด ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค.

with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

        json_data = []
        for line in all_data:
            try:
                json_data.append(json.loads(line))
            except:
                break
return json_data

์šฐ๋ฆฌ๋Š” ๋‚˜์ค‘์— JSON ๊ตฌ๋ฌธ ๋ถ„์„์ด ๋„ˆ๋ฌด ๋ฌด๊ฒ๊ณ  JSON์— ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋„ˆ๋ฌด ๋งŽ๋‹ค๊ณ  ์ƒ๊ฐํ•œ ๋‹ค์Œ JSON ๋ฌธ์ž์—ด์„ ๊ตฌ๋ฌธ ๋ถ„์„ํ•˜๊ณ  ์ˆ˜๋™์œผ๋กœ ๊ธฐ๋Šฅ ๋ชฉ๋ก์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ๋กœ ์„ ํƒํ–ˆ์ง€๋งŒ ๋™์ผํ•œ ์‹คํŒจ์ž…๋‹ˆ๋‹ค. ์ผ๋ถ€ ์Šคํƒ ์ถ”์  ๋ถ„์„์„ ์ˆ˜ํ–‰ํ–ˆ์ง€๋งŒ ๊ฑฐ๊ธฐ์—๋Š” ์•„๋ฌด๊ฒƒ๋„ ์—†์—ˆ์Šต๋‹ˆ๋‹ค.

BTW, ์šฐ๋ฆฌ๋Š” Linux Docker Env, 24์ฝ”์–ด CPU ๋ฐ 1 V100์—์„œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์—๋Š” ์–ด๋””์—์„œ ์กฐ์‚ฌ๋ฅผ ์‹œ์ž‘ํ•ด์•ผ ํ• ์ง€ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋‹น์‹ ์€ ์–ด๋–ค ์ƒ๊ฐ์ด ์žˆ์Šต๋‹ˆ๊นŒ?

์•ˆ๋…•,

https://github.com/open-mmlab/mmdetection ์—์„œ ์‚ฌ์šฉ๋˜๋Š” https://github.com/open-mmlab/mmcv ์—์„œ ํฅ๋ฏธ๋กœ์šด ์˜๊ฒฌ์„ ์ฐพ์•˜์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ์ฝ”๋“œ๋Š” train epoch์™€ val epoch์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
time.sleep(2) # Epoch ์ „ํ™˜ ๋™์•ˆ ๊ฐ€๋Šฅํ•œ ๊ต์ฐฉ ์ƒํƒœ ๋ฐฉ์ง€

https://github.com/open-mmlab/mmcv/blob/1cb3e36a1ea33caf272d2365c7d406123122b8d0/mmcv/runner/epoch_based_runner.py#L26

๋‹น์‹ ์ด ๊ทธ๊ฒƒ์„ ์‹œ๋„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

BTW, ๋‹ค์ค‘ ์ž‘์—…์ž ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์žˆ๋Š” ๊ฐ ํ”„๋กœ์„ธ์Šค ๋ฐ ๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค๋กœ ์ด๋™ํ•˜๋ฉด ํ•ด๋‹น ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค์˜ ๋ฐ์ดํ„ฐ ๋กœ๋”์™€ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์ง€ ์•Š๋„๋ก ํ•˜๋Š” ํ”„๋กœ์„ธ์Šค๊ฐ€ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅธ๊ฐ€์š”? ti๋Š” ์ด๋ฏธ pytorch ๋ฐ์ดํ„ฐ ๋กœ๋” __get_item__ ์˜ํ•ด ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๊นŒ?

์•ˆ๋…•ํ•˜์„ธ์š” @SsnL , ๋„์™€์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ด ์Šค๋ ˆ๋“œ์— ๋Œ€ํ•œ ํ›„์† ์กฐ์น˜๋ฅผ ์ทจํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค. pytorch ๋‹ค์ค‘ ์ฒ˜๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ต์œก ์ฝ”๋“œ๋ฅผ ๋ฆฌํŒฉํ„ฐ๋งํ•˜์—ฌ CPU ์ธก์—์„œ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๋†’์ž…๋‹ˆ๋‹ค(GPU์— ๋” ๋นจ๋ฆฌ ๊ณต๊ธ‰ํ•˜๊ธฐ ์œ„ํ•ด), https://pytorch.org/docs/stable /notes/multiprocessing.html#multiprocessing -๋ชจ๋ฒ” ์‚ฌ๋ก€

๋˜ํ•œ ๊ฐ ์ฒ˜๋ฆฌ ๊ธฐ๋Šฅ์—์„œ ๋‹ค์ค‘ ์ž‘์—…์ž ๋ฐ์ดํ„ฐ ๋กœ๋”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•ฉ๋‹ˆ๋‹ค. https://pytorch.org/docs/stable/data.html

๋‚˜๋Š” ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์•„๋‹Œ ๋ฉ”์ธ ํŠธ๋ ˆ์ด๋‹ ๊ณผ์ •์—์„œ ๋‚ด heaving CPU JSON ๊ตฌ๋ฌธ ๋ถ„์„์„ ๋„ฃ์—ˆ๊ณ  ๋ฌธ์ œ๊ฐ€ ์‚ฌ๋ผ์ง„ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด์œ ๋Š” ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ ์–ด์จŒ๋“  ์ž‘๋™ํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ›„์† ์งˆ๋ฌธ์ด ์žˆ์Šต๋‹ˆ๋‹ค. N ์ฒ˜๋ฆฌ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ๊ฐ๊ฐ์—๋Š” M ๋กœ๋” ์ž‘์—…์ž๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ์Šค๋ ˆ๋”ฉ ์•„๋ž˜์— ์ด NxM ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚ด ๋ฐ์ดํ„ฐ ๋กœ๋”์—์„œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ์ธ๋ฑ์Šค ๋ฐฉ์‹์œผ๋กœ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด N๊ฐœ์˜ ๋‹ค๋ฅธ ์ฒ˜๋ฆฌ์—์„œ M ๋ฐ์ดํ„ฐ ๋กœ๋”์—์„œ __get_item__(self, idx) ๊ฐ€ ๋‹ค๋ฅธ ์ธ๋ฑ์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํ•จ๊ป˜ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ค‘๋ณต ๋˜๋Š” ์ผ๋ถ€ ํ”„๋กœ์„ธ์Šค๋ฅผ ๊ทธ๋ฆฌ์›Œ?

์ƒˆ๋กœ์šด ํ›ˆ๋ จ ๋˜๋Š” ๊ฒ€์ฆ ์—ํฌํฌ์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์—์„œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•  ์ˆ˜ ์—†๋‹ค๊ณ  ๋ถˆํ‰ํ•œ ํ›„ ๋ฐ์ดํ„ฐ ๋กœ๋”๊ฐ€ ์ถฉ๋Œํ•˜๋Š” ๋™์ผํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์œ„์˜ ์†”๋ฃจ์…˜์€ (i) /dev/shm ๊ฐ€ 32GB์ด๊ณ  2.5GB ์ด์ƒ ์‚ฌ์šฉ๋œ ์ ์ด ์—†์œผ๋ฉฐ (ii) pin_memory=False ์„ค์ •์ด ์ž‘๋™ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ์•„๋งˆ๋„ ๊ฐ€๋น„์ง€ ์ˆ˜์ง‘๊ณผ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๊นŒ? ๋‚ด ์ฝ”๋“œ๋Š” ๋Œ€๋žต ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋ฌดํ•œ ๋ฐ˜๋ณต์ž๊ฐ€ ํ•„์š”ํ•˜๋ฏ€๋กœ ์•„๋ž˜ next() ์ฃผ์œ„๋ฅผ ์ œ์™ธํ•˜๊ณ  ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค :-)

def train():
    train_iter = train_loader.__iter__()
    for i in xrange(max_batches):
        try:
            x, y = next(train_iter)
        except StopIteration:
            train_iter = train_loader.__iter__()
        ...
    del train_iter

train_loader ๋Š” DataLoader ๊ฐœ์ฒด์ž…๋‹ˆ๋‹ค. ํ•จ์ˆ˜ ๋์— ๋ช…์‹œ์ ์ธ del train_iter ์ค„์ด ์—†์œผ๋ฉด ํ”„๋กœ์„ธ์Šค๋Š” ํ•ญ์ƒ 2-3 ์—ํฌํฌ ํ›„์— ์ถฉ๋Œํ•ฉ๋‹ˆ๋‹ค( /dev/shm ์—ฌ์ „ํžˆ 2.5GB๋ฅผ ํ‘œ์‹œํ•จ). ๋„์›€์ด ๋˜์—ˆ๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค!

4 ์ž‘์—…์ž(Ubuntu 16.04์˜ CUDA 8.0 ๋ฒ„์ „ 0.1.12_2 )๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ๋ช‡ ์ฃผ ๋™์•ˆ ๊ณ ๊ตฐ๋ถ„ํˆฌ ํ•œ ํ›„ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. ๋กœ๋”๋ฅผ ์ง์ ‘ ๋ฐ˜๋ณตํ•˜๋Š” ๋Œ€์‹  ๋กœ๋” ๋ฐ˜๋ณต์ž๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋ฉฐ ์—ํฌํฌ๊ฐ€ ๋๋‚  ๋•Œ del loader_iterator ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งˆ์นจ๋‚ด ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ์ œ๊ฑฐํ–ˆ์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ๊ฐ™์€ ๋ฌธ์ œ์— ์ง๋ฉดํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. 8๊ฐœ์˜ ๋ฐ์ดํ„ฐ ๋กœ๋”(MNIST, MNISTM, SVHN, USPS, ๊ฐ๊ฐ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ์šฉ)๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. 6(๋ชจ๋“  6)์„ ์‚ฌ์šฉํ•˜๋ฉด ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. 8์„ ์‚ฌ์šฉํ•˜๋ฉด 6๋ฒˆ์งธ MNIST-M ํ…Œ์ŠคํŠธ๋ฅผ ๋กœ๋“œํ•  ๋•Œ ํ•ญ์ƒ ์ฐจ๋‹จ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰์„ ์‹œ๋„ํ•˜๊ณ , ์‹คํŒจํ•˜๊ณ , ์กฐ๊ธˆ ๊ธฐ๋‹ค๋ ธ๋‹ค๊ฐ€ ๋‹ค์‹œ ์‹œ๋„ํ•˜๋Š” ๋์—†๋Š” ๋ฃจํ”„์— ๊ฐ‡ํ˜€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  batch_size์— ๋Œ€ํ•ด ์˜ค๋ฅ˜๊ฐ€ ์ง€์†๋˜๊ณ  ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋งŽ์ด ๋‚จ์•„ ์žˆ์œผ๋ฉฐ num_workers๋ฅผ 0์œผ๋กœ ์„ค์ •ํ•œ ๊ฒฝ์šฐ์—๋งŒ ์‚ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์–‘์ด ๋ฌธ์ œ์˜ ์›์ธ์ž…๋‹ˆ๋‹ค.

https://stackoverflow.com/questions/54013846/pytorch-dataloader-stucked-if-using-opencv-resize-method ์—์„œ ํžŒํŠธ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค.
cv2.setNumThreads(0) ๋„ฃ์œผ๋ฉด ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

์•ˆ๋…•ํ•˜์„ธ์š”, ๋‚˜๋Š” ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ๊ฒƒ์€ ulimit -n๊ณผ ๊ด€๋ จ์ด ์žˆ์—ˆ๊ณ  ๋‹จ์ˆœํžˆ ๋Š˜๋ฆฌ๋ฉด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” ulimit -n 500000์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

@SebastienEske ulimit -n Ubuntu 20.04์—์„œ๋„ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค.

์•„๋งˆ๋„ set ulimit -n ์ด ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ๊ต์ฐฉ ์ƒํƒœ๊ฐ€ ์ ์  ๋” ์ž์ฃผ ๋ฐœ์ƒํ•˜๊ณ  cv2.setNumThreads(0) ํ…Œ์ŠคํŠธ๋„ ์ˆ˜ํ–‰ํ•˜์ง€๋งŒ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๊ธฐ๋ก์„ ์œ„ํ•ด cv2.setNumThreads(0) ๊ฐ€ ์ €์—๊ฒŒ ํšจ๊ณผ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰