ãã°ã¯pytorch / examplesïŒ148ã§èª¬æãããŠããŸãã ãµã³ãã«ã³ãŒãã¯ç§ã«ã¯ãããã«èŠããã®ã§ãããã¯PyTorchèªäœã®ãã°ãã©ããçåã«æããŸãã ãŸããããã¯ïŒ1120ã«é¢é£ããŠããã®ã ãããã
ããŒããŒãåæ¢ãããšãã«ãã©ã®ãããã®ç©ºãã¡ã¢ãªããããŸããïŒ
@apaszke top
ããã§ãã¯ãããšãæ®ãã®ã¡ã¢ãªïŒãã£ãã·ã¥ãããã¡ã¢ãªã䜿çšæžã¿ãšããŠã«ãŠã³ããããŸãïŒã¯éåžž2GBã§ãã ãã ãããã£ãã·ã¥ããããã®ã䜿çšæžã¿ãšããŠã«ãŠã³ãããªãå Žåã¯ãåžžã«å€ããããšãã°30GB以äžã«ãªããŸãã
ãŸããæ€èšŒã®éå§æã«åžžã«åæ¢ããçç±ã¯ããããŸãããããã以å€ã®å Žæã§ã¯åæ¢ããŸããã
ãããããæ€èšŒã®ããã«ãå ±æã¡ã¢ãªã®äœ¿çšãå¶éãè¶ ããŠããã·ã¥ããå¥ã®ããŒããŒã䜿çšãããŠããããã§ãã
@ngimel
ããã°ã©ã ãããäžåºŠå®è¡ããŸããã ãããŠè¡ãè©°ãŸããŸããã
top
åºåïŒ
~~~
ããã-17ïŒ51ïŒ182æ¥éã21ïŒ05ã2ãŠãŒã¶ãŒãå¹³åè² è·ïŒ0.49ã3.00ã5.41
ã¿ã¹ã¯ïŒåèš357ãå®è¡äž2ãç¡ç äž355ãåæ¢äž0ããŸã³ã0
ïŒ
CpuïŒsïŒïŒ1.9 usã0.1 syã0.7 niã97.3 idã0.0 waã0.0 hiã0.0 siã0.0 st
KiB MemïŒåèš65863816ã䜿çšæžã¿60115084ãç¡æ5748732ããããã¡ãŒ1372688
KiBã¹ã¯ããïŒåèš5917692ã䜿çšæžã¿620ãç¡æ5917072ã 51154784ãã£ãã·ã¥ãããMem
PIDãŠãŒã¶ãŒPRNI VIRT RES SHR SïŒ
CPUïŒ
MEM TIME + COMMAND 3067 aalreja 20 0 143332 101816 21300 R 46.1 0.2 1631ïŒ44 Xvnc
16613 aalreja 30 10 32836 4880 3912 S 16.9 0.0 1ïŒ06.92ãã¡ã€ããŒã©ã³ã3221 aalreja 20 0 8882348 1.017g 110120 S 1.3 1.6 579ïŒ06.87 MATLAB
1285ã«ãŒã200 1404848 48252 25580 S 0.3 0.1 6ïŒ00.12 dockerd 16597 yimengz + 20 0 25084 3252 2572 R 0.3 0.0 0ïŒ04.56 top
1ã«ãŒã200 33616 4008 2624 S 0.0 0.0 0ïŒ01.43 init
~~~
free
åºå
ãyimengzh_everyday @ yimengzh ïŒã$ç¡æãã£ãã·ã¥ããã䜿çšæžã¿ç©ºãå
±æãããã¡ã®åèšMemïŒ65863816 60122060 5741756 9954628 1372688 51154916-/ +ãããã¡/ãã£ãã·ã¥ïŒ7594465 58269360ã¹ã¯ããïŒ5917692 620 5917072ã
nvidia-smi
åºå
~~~
yimengzh_everyday @ yimengzh ïŒã$ nvidia-smi
2017幎4æ25æ¥ç«ææ¥17:52:38
+ ------------------------------------------------- ---------------------------- +
| NVIDIA-SMI 375.39ãã©ã€ããŒããŒãžã§ã³ïŒ375.39 |
| ------------------------------- + ----------------- ----- + ---------------------- +
| GPUåã®æ°žç¶æ§-M | Bus-Id Disp.A | æ®çºæ§ã®Uncorrã ECC |
| Fan Temp Perf PwrïŒUsage / Cap | ã¡ã¢ãª-䜿çšæ³| GPU-Util Compute M. |
| =============================== + ================= ===== + ====================== |
| 0 GeForce GTX TIT ...ãªã| 0000ïŒ03ïŒ00.0ãªã| 該åœãªã|
| 30ïŒ
42C P8 14W / 250W | 3986MiB / 6082MiB | 0ïŒ
ããã©ã«ã|
+ ------------------------------- + ----------------- ----- + ---------------------- +
| 1ãã¹ã©K40cãªã| 0000ïŒ81ïŒ00.0ãªã| ãªã|
| 0ïŒ
46C P0 57W / 235W | 0MiB / 12205MiB | 0ïŒ
ããã©ã«ã|
+ ------------------------------- + ----------------- ----- + ---------------------- +
+ ------------------------------------------------- ---------------------------- +
| ããã»ã¹ïŒGPUã¡ã¢ãª|
| GPUPIDã¿ã€ãããã»ã¹å䜿çšæ³|
| ================================================= ============================ |
| 0 16509 C python 3970MiB |
+ ------------------------------------------------- ---------------------------- +
~~~
èšæ¶ã®åé¡ã§ã¯ãªããšæããŸãã
å
±æã¡ã¢ãªã«ã¯åå¥ã®å¶éããããŸãã ipcs -lm
ãŸãã¯cat /proc/sys/kernel/shmall
ãšcat /proc/sys/kernel/shmmax
ãè©Šãããšãã§ããŸããïŒ ãŸãã䜿çšããã¯ãŒã«ãŒãå°ãªãå ŽåïŒããšãã°ã1人ã®ã¯ãŒã«ãŒã®æ¥µç«¯ãªã±ãŒã¹ã§ãã¹ãããå ŽåïŒããããããã¯ãçºçããŸããïŒ
@apaszke
~~~
yimengzh_everyday @ yimengzh ïŒã$ ipcs -lm
------å
±æã¡ã¢ãªã®å¶é--------
ã»ã°ã¡ã³ãã®æ倧æ°= 4096
æ倧ã»ã°ã¡ã³ããµã€ãºïŒãããã€ãïŒ= 18014398509465599
æ倧åèšå
±æã¡ã¢ãªïŒãããã€ãïŒ= 18446744073642442748
æå°ã»ã°ã¡ã³ããµã€ãºïŒãã€ãïŒ= 1
yimengzh_everyday @ yimengzh ïŒã$ cat / proc / sys / kernel / shmall
18446744073692774399
yimengzh_everyday @ yimengzh ïŒã$ cat / proc / sys / kernel / shmmax
18446744073692774399
~~~
圌ãã¯ã©ã®ããã«ããªããæ¢ããŸããïŒ
åŽåè ã®æ°ãå°ãªããšããããšã«é¢ããŠã¯ãããã»ã©é »ç¹ã«ã¯èµ·ãããªããšæããŸãã ïŒç§ã¯ä»è©Šãããšãã§ããŸãïŒã ããããå®éã«ã¯ããã®å€ãã®åŽåè ãå¿ èŠã ãšæããŸãã
æ倧4096ã®å
±æã¡ã¢ãªã»ã°ã¡ã³ããèš±å¯ãããŠããŸãããããåé¡ã§ããå¯èœæ§ããããŸãã /proc/sys/kernel/shmmni
æžã蟌ãããšã§ããããå¢ããããšãã§ããŸãïŒãããã8192ãè©ŠããŠãã ããïŒã ã¹ãŒããŒãŠãŒã¶ãŒæš©éãå¿
èŠãªå ŽåããããŸãã
@apaszkeãŸããããã¯Ubuntuãš
ãã¬ãŒãã³ã°ããã°ã©ã ãå®è¡ããŠãããšãã®@apaszke ã ipcs -a
å®éã«ã¯å
±æã¡ã¢ãªã䜿çšãããŠããªãããšã瀺ããŠããŸãã ããã¯æåŸ
ãããŠããŸããïŒ
@apaszkeã¯ã
~~~
yimengzh_everyday @ yimengzh ïŒã$ ipcs -lm
------å
±æã¡ã¢ãªã®å¶é--------
ã»ã°ã¡ã³ãã®æ倧æ°= 8192
æ倧ã»ã°ã¡ã³ããµã€ãºïŒãããã€ãïŒ= 18014398509465599
æ倧åèšå
±æã¡ã¢ãªïŒãããã€ãïŒ= 18446744073642442748
æå°ã»ã°ã¡ã³ããµã€ãºïŒãã€ãïŒ= 1
~~~
1人ã®åŽåè ãè©ŠããŸããã§ããã ãŸããããã¯é ãã§ãããã 第äºã«ãåé¡ãæ¬åœã«ãããããã¯ã§ããå Žåãããã¯ééããªãæ¶ããŸãã
@ zym1010ã®ããã©ã«ãèšå®ã¯ããã®ãããªã¯ãŒã¯ããŒãã念é ã«çœ®ããŠäœæããå¿
èŠã¯ãªãã®ã§ãããã§ããããã¯åé¡ã§ãã£ãå¯èœæ§ããããŸãã ipcs
ã¯ã䜿çšããŠããªãSystem Vå
±æã¡ã¢ãªçšã§ãããåãå¶éãPOSIXå
±æã¡ã¢ãªã«é©çšãããªãããã«ãããã£ãã®ã§ãã
åé¡ãå®éã«ååšããå Žåã¯ãã¯ãŒã«ãŒãšã¡ã€ã³ããã»ã¹ã®éã§ãããããã¯ãçºçããŠããå¯èœæ§ãããã1人ã®ã¯ãŒã«ãŒã§ãããããªã¬ãŒã§ããå¯èœæ§ãããããã確å®ã«æ¶ããããšã¯ãããŸããã ãšã«ãããç§ã¯ãããåçŸããããšãã§ãããŸã§åé¡ãä¿®æ£ããããšã¯ã§ããŸããã äŸãå®è¡ããããã«äœ¿çšããŠãããã©ã¡ãŒã¿ãŒã¯äœã§ããïŒãŸããã³ãŒããäœããã®æ¹æ³ã§å€æŽããŸãããïŒ ãŸãã torch.__version__
ã®å€ã¯äœã§ããïŒ Dockerã§å®è¡ããŠããŸããïŒ
@apaszkeããããšãã ç§ã¯ããªãã®åæãä»ããããç解ããŠããŸãã
64GB RAMããã¥ã¢ã«XeonãTitanBlackãæèŒããUbuntu14.04ãã·ã³ã§å®è¡ããããŸã§ã«ç€ºãããä»ã®ãã¹ãŠã®çµæïŒK40ããããŸãããç§ã¯äœ¿çšããŸããã§ããïŒã
åé¡ãçæããã³ãã³ãã¯CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 20 --lr 0.01 --workers 22 --batch-size 256 /mnt/temp_drive_3/cv_datasets/ILSVRC2015/Data/CLS-LOC
ã§ãã ã³ãŒãã¯ãŸã£ããå€æŽããŸããã§ããã
Python3.5ã«pipãä»ããŠpytorchãã€ã³ã¹ããŒã«ããŸããã pytorchã®ããŒãžã§ã³ã¯0.1.11_5
ã§ãã Dockerã§å®è¡ãããŠããŸããã
ãšããã§ãç§ã1人ã®ã¯ãŒã«ãŒã䜿ã£ãŠã¿ãŸããã ããããç§ã¯å¥ã®ãã·ã³ïŒ128GB RAMããã¥ã¢ã«Xeonã4 Pascal Titan XãCentOS 6ïŒã§ãããè¡ããŸããã CUDA_VISIBLE_DEVICES=0 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 1 --lr 0.01 --workers 1 --batch-size 256 /ssd/cv_datasets/ILSVRC2015/Data/CLS-LOC
ã䜿çšããŠå®è¡ããŸãããããšã©ãŒãã°ã¯æ¬¡ã®ãšããã§ãã
Epoch: [0][5003/5005] Time 2.463 (2.955) Data 2.414 (2.903) Loss 5.9677 (6.6311) Prec<strong i="14">@1</strong> 3.516 (0.545) Prec<strong i="15">@5</strong> 8.594 (2.262)
Epoch: [0][5004/5005] Time 1.977 (2.955) Data 1.303 (2.903) Loss 5.9529 (6.6310) Prec<strong i="16">@1</strong> 1.399 (0.545) Prec<strong i="17">@5</strong> 7.692 (2.262)
^CTraceback (most recent call last):
File "main.py", line 292, in <module>
main()
File "main.py", line 137, in main
prec1 = validate(val_loader, model, criterion)
File "main.py", line 210, in validate
for i, (input, target) in enumerate(val_loader):
File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
idx, batch = self.data_queue.get()
File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()
File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()
top
ã¯ã1人ã®ã¯ãŒã«ãŒã§ã¹ã¿ãã¯ãããšãã«æ¬¡ã®ããšã瀺ããŸããã
ãããã-08ïŒ34ïŒ33ã¢ãã15æ¥ã20ïŒ03ã0ãŠãŒã¶ãŒãå¹³åè² è·ïŒ0.37ã0.39ã0.36ã¿ã¹ã¯ïŒåèš894ãå®è¡äž1ãç¡ç äž892ãåæ¢äž0ããŸã³ã1CPUïŒ7.2ïŒ
usã2.8ïŒ
syã0.0ïŒ
niã89.7ïŒ
idã0.3ïŒ
waã0.0ïŒ
hiã0.0ïŒ
siã0.0ïŒ
stMemïŒåèš132196824kã䜿çšæžã¿131461528kã空ã735296kããããã¡ãŒ347448kã¹ã¯ããïŒåèš2047996kã䜿çš22656kãç¡æ2025340kããã£ãã·ã¥125226796kã
ç§ãèŠã€ãããã1ã€ã®ããšã¯ããã¬ãŒãã³ã°ã³ãŒããå€æŽããŠããã¹ãŠã®ããããééããªãããã«ããå Žåãããšãã°ã50ãããã®ã¿ããã¬ãŒãã³ã°ããããšã§ãã
if i >= 50:
break
ãã®åŸããããããã¯ã¯è§£æ¶ãããããã§ãã
ãããªããã¹ãã¯ãã³ã³ãã¥ãŒã¿ãåèµ·åããçŽåŸã«ããã°ã©ã ãå®è¡ããå Žåããã®ããªãŒãºãã¯ããã«é »ç¹ã«çºçããããšã瀺åããŠããããã§ãã ã³ã³ãã¥ãŒã¿ã«ãã£ãã·ã¥ããã£ãåŸããã®ããªãŒãºãçºçããé »åºŠã¯å°ãªãããã§ãã
è©ŠããŸãããããã®ãã°ãåçŸããããšã¯ã§ããŸããã
åæ§ã®åé¡ãçºçããŸããããšããã¯ãçµäºãããšããŒã¿ããŒããŒãåæ¢ããæ°ãããšããã¯ãéå§ããŸãã
num_workers = 0ã«èšå®ãããšæ©èœããŸãã ããããããã°ã©ã ã¯é ããªããŸãã
@apaszkeæåã«ã³ã³ãã¥ãŒã¿ãŒãåèµ·åããŠãããããã°ã©ã ãå®è¡ããŠã¿ãŸãããïŒ ç§ã«ãšã£ãŠãããã¯åçµãä¿èšŒããŸãã 0.12ããŒãžã§ã³ãè©ŠããŸããããããã§ãåãã§ãã
ç§ãææãããã®ã¯ãOpenBLASã«ãªã³ã¯ãããnumpyãã€ã³ã¹ããŒã«ãããŠããŠã @ soumithã®anacondaã¯ã©ãŠãã®MKLãpip
ã䜿çšããŠpytorchãã€ã³ã¹ããŒã«ããããšã§ãã
ã€ãŸããåºæ¬çã«pytorchã¯MKLã䜿çšããnumpyã¯OpenBLASã䜿çšããŠããŸãã ããã¯çæ³çã§ã¯ãªããããããŸããããããã¯ããã§ã®åé¡ãšã¯äœã®é¢ä¿ããªãã¯ãã ãšæããŸãã
調ã¹ãŠã¿ãŸããããåçŸã§ããŸããã§ããã MKL / OpenBLASã¯ãã®åé¡ãšã¯ç¡é¢ä¿ã§ããå¿ èŠããããŸããã·ã¹ãã æ§æã«åé¡ãããå¯èœæ§ããããŸã
@apaszkeããããšãã anacondaã®å ¬åŒãªããžããªãšMKLããŒã¹ã®pytorchããPythonãè©ŠããŸããã ããã§ãåãåé¡ã
Dockerã§ã³ãŒããå®è¡ããŠã¿ãŸããã ãŸã ç«ã¡åŸçã
åãåé¡ãããã4ã€ã®ãã¡1ã€ã®GPUã䜿çšããŠnvidia-dockerå ã§pytorch / examples imagenetãã¬ãŒãã³ã°ã®äŸïŒresnet18ã4人ã®ã¯ãŒã«ãŒïŒãå®è¡ããŸããããã»ã¹ã«å°éã§ããããgdbããã¯ãã¬ãŒã¹ãåéããããšããŸãã ã
å°ãªããšãOpenBLASã«ã¯ãè¡åã®ä¹ç®ã§ãããããã¯ã®åé¡ãããããšãç¥ãããŠããŸãããããã¯æ¯èŒçãŸãã«ããçºçããŸããïŒ https ïŒ
@jsainioçŽç²ãªMKLããŒã¹ã®PyTorchïŒnumpyã¯MKLã«ããªã³ã¯ãããŠããŸãïŒãè©ŠããŸããããåãåé¡ãçºçããŸããã
ãŸããããŒã¿ããŒããŒã«pin_memory
ã䜿çšãããšããã®åé¡ã¯è§£æ±ºãããŸãïŒå°ãªããšãç§ã«ãšã£ãŠã¯ïŒã
2人ã®åŽåè ãæ»ãã ããã«èŠããŸãã
éåžžã®æäœäžïŒ
root<strong i="7">@b06f896d5c1d</strong>:~/mnt# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
user+ 1 33.2 4.7 91492324 3098288 ? Ssl 10:51 1:10 python -m runne
user+ 58 76.8 2.3 91079060 1547512 ? Rl 10:54 1:03 python -m runne
user+ 59 76.0 2.2 91006896 1484536 ? Rl 10:54 1:02 python -m runne
user+ 60 76.4 2.3 91099448 1559992 ? Rl 10:54 1:02 python -m runne
user+ 61 79.4 2.2 91008344 1465292 ? Rl 10:54 1:05 python -m runne
ããã¯ã¢ããåŸïŒ
root<strong i="11">@b06f896d5c1d</strong>:~/mnt# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
user+ 1 24.8 4.4 91509728 2919744 ? Ssl 14:25 13:01 python -m runne
user+ 58 51.7 0.0 0 0 ? Z 14:27 26:20 [python] <defun
user+ 59 52.1 0.0 0 0 ? Z 14:27 26:34 [python] <defun
user+ 60 52.0 2.4 91147008 1604628 ? Sl 14:27 26:31 python -m runne
user+ 61 52.0 2.3 91128424 1532088 ? Sl 14:27 26:29 python -m runne
ãŸã æ®ã£ãŠãã1ã€ã®ã¯ãŒã«ãŒã®å Žåãgdbã¹ã¿ãã¯ãã¬ãŒã¹ã®å é ã¯æ¬¡ã®ããã«ãªããŸãã
root<strong i="15">@b06f896d5c1d</strong>:~/mnt# gdb --pid 60
GNU gdb (GDB) 8.0
Attaching to process 60
[New LWP 65]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f36f52af827 in do_futex_wait.constprop ()
from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0 0x00007f36f52af827 in do_futex_wait.constprop ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f36f52af8d4 in __new_sem_wait_slow.constprop.0 ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f36f52af97a in sem_wait@@GLIBC_2.2.5 ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f36f157efb1 in semlock_acquire (self=0x7f3656296458,
args=<optimized out>, kwds=<optimized out>)
at /home/ilan/minonda/conda-bld/work/Python-3.5.2/Modules/_multiprocessing/semaphore.c:307
#4 0x00007f36f5579621 in PyCFunction_Call (func=
<built-in method __enter__ of _multiprocessing.SemLock object at remote 0x7f3656296458>, args=(), kwds=<optimized out>) at Objects/methodobject.c:98
#5 0x00007f36f5600bd5 in call_function (oparg=<optimized out>,
pp_stack=0x7f36c7ffbdb8) at Python/ceval.c:4705
#6 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
at Python/ceval.c:3236
#7 0x00007f36f5601b49 in _PyEval_EvalCodeWithName (_co=<optimized out>,
globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0,
closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#8 0x00007f36f5601cd8 in PyEval_EvalCodeEx (_co=<optimized out>,
globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0,
defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#9 0x00007f36f5557542 in function_call (
func=<function at remote 0x7f36561c7d08>,
arg=(<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7f3656296458>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7f3656296458>, _semlock=<_multiprocessing.SemLock at remote 0x7f3656296458>) at remote 0x7f3656296438>,), kw=0x0)
at Objects/funcobject.c:627
#10 0x00007f36f5524236 in PyObject_Call (
func=<function at remote 0x7f36561c7d08>, arg=<optimized out>,
kw=<optimized out>) at Objects/abstract.c:2165
#11 0x00007f36f554077c in method_call (
func=<function at remote 0x7f36561c7d08>,
arg=(<Lock(release=<built-in method release of _multiprocessing.SemLock object at remote 0x7f3656296458>, acquire=<built-in method acquire of _multiprocessing.SemLock object at remote 0x7f3656296458>, _semlock=<_multiprocessing.SemLock at remote 0x7f3656296458>) at remote 0x7f3656296438>,), kw=0x0)
at Objects/classobject.c:330
#12 0x00007f36f5524236 in PyObject_Call (
func=<method at remote 0x7f36556f9248>, arg=<optimized out>,
kw=<optimized out>) at Objects/abstract.c:2165
#13 0x00007f36f55277d9 in PyObject_CallFunctionObjArgs (
callable=<method at remote 0x7f36556f9248>) at Objects/abstract.c:2445
#14 0x00007f36f55fc3a9 in PyEval_EvalFrameEx (f=<optimized out>,
throwflag=<optimized out>) at Python/ceval.c:3107
#15 0x00007f36f5601166 in fast_function (nk=<optimized out>, na=1,
n=<optimized out>, pp_stack=0x7f36c7ffc418,
func=<function at remote 0x7f36561c78c8>) at Python/ceval.c:4803
#16 call_function (oparg=<optimized out>, pp_stack=0x7f36c7ffc418)
at Python/ceval.c:4730
#17 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
at Python/ceval.c:3236
#18 0x00007f36f5601b49 in _PyEval_EvalCodeWithName (_co=<optimized out>,
globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
argcount=4, kws=0x7f36f5b85060, kwcount=0, defs=0x0, defcount=0,
kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4018
#19 0x00007f36f5601cd8 in PyEval_EvalCodeEx (_co=<optimized out>,
globals=<optimized out>, locals=<optimized out>, args=<optimized out>,
argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0,
defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4039
#20 0x00007f36f5557661 in function_call (
func=<function at remote 0x7f36e14170d0>,
arg=(<ImageFolder(class_to_idx={'n04153751': 783, 'n02051845': 144, 'n03461385': 582, 'n04350905': 834, 'n02105056': 224, 'n02112137': 260, 'n03938244': 721, 'n01739381': 59, 'n01797886': 82, 'n04286575': 818, 'n02113978': 268, 'n03998194': 741, 'n15075141': 999, 'n03594945': 609, 'n04099969': 765, 'n02002724': 128, 'n03131574': 520, 'n07697537': 934, 'n04380533': 846, 'n02114712': 271, 'n01631663': 27, 'n04259630': 808, 'n04326547': 825, 'n02480855': 366, 'n02099429': 206, 'n03590841': 607, 'n02497673': 383, 'n09332890': 975, 'n02643566': 396, 'n03658185': 623, 'n04090263': 764, 'n03404251': 568, 'n03627232': 616, 'n01534433': 13, 'n04476259': 868, 'n03495258': 594, 'n04579145': 901, 'n04266014': 812, 'n01665541': 34, 'n09472597': 980, 'n02095570': 189, 'n02089867': 166, 'n02009229': 131, 'n02094433': 187, 'n04154565': 784, 'n02107312': 237, 'n04372370': 844, 'n02489166': 376, 'n03482405': 588, 'n04040759': 753, 'n01774750': 76, 'n01614925': 22, 'n01855032': 98, 'n03903868': 708, 'n02422699': 352, 'n01560419': 1...(truncated), kw={}) at Objects/funcobject.c:627
#21 0x00007f36f5524236 in PyObject_Call (
func=<function at remote 0x7f36e14170d0>, arg=<optimized out>,
kw=<optimized out>) at Objects/abstract.c:2165
#22 0x00007f36f55fe234 in ext_do_call (nk=1444355432, na=0,
flags=<optimized out>, pp_stack=0x7f36c7ffc768,
func=<function at remote 0x7f36e14170d0>) at Python/ceval.c:5034
#23 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>)
at Python/ceval.c:3275
--snip--
åæ§ã®ãšã©ãŒãã°ããããã¡ã€ã³ããã»ã¹ãã¹ã¿ãã¯ããŠããŸãïŒself.data_queue.getïŒïŒ
ç§ã«ãšã£ãŠã®åé¡ã¯ãç»åããŒããŒãšããŠopencvã䜿çšããããšã§ããã ãŸããcv2.imreadé¢æ°ã¯ãimagenetã®ç¹å®ã®ã€ã¡ãŒãžïŒ "n01630670 / n01630670_1010.jpeg"ïŒã§ãšã©ãŒãªãã§ç¡æéã«ãã³ã°ããŠããŸããã
num_workers = 0ã§æ©èœããŠãããšèšã£ãå Žåã¯ãããã§ã¯ãããŸããã ããããç§ã¯ãããåæ§ã®ãšã©ãŒãã¬ãŒã¹ãæã€äžéšã®äººã ãå©ãããããããªããšæããŸããã
çŸåšnum_workers = 0
ãã¹ããå®è¡ããŠããŸããããŸã ãã³ã°ããŠããŸããã https://github.com/pytorch/examples/blob/master/imagenet/main.pyãããµã³ãã«ã³ãŒããå®è¡ããŠãpytorch/vision
ImageFolderã¯å
éšã§PIL
ãŸãã¯pytorch/accimage
ã䜿çšããŠç»åãããŒãããŠããããã§ãããããOpenCVã¯é¢äžããŠããŸããã
num_workers = 4
ã䜿çšãããšãæåã®ãšããã¯ãã¬ã€ã³ãååŸããŠå®å
šã«æ€èšŒã§ããå Žåãããã2çªç®ã®ãšããã¯ã®éäžã§ããã¯ãããŸãã ãããã£ãŠãããŒã¿ã»ãã/èªã¿èŸŒã¿é¢æ°ã§åé¡ãçºçããå¯èœæ§ã¯ã»ãšãã©ãããŸããã
ããã¯ãç¹å®ã®ããŒããŠã§ã¢/ãœãããŠã§ã¢ã®çµã¿åããã«ãã£ãŠæ¯èŒçãŸãã«ããªã¬ãŒãããå¯èœæ§ã®ããImageLoaderã®ç«¶åç¶æ ã®ããã«èŠããŸãã
@ zym1010ãã€ã³ã¿ãŒãããããšãã pin_memory = False
èšå®ããŠã¿ãŸãã
é¢çœãã ç§ã®ã»ããã¢ããã§ã¯ã pin_memory = False
ãšnum_workers = 4
ãããšãimagenetã®äŸãã»ãŒããã«ãã³ã°ãã3人ã®ã¯ãŒã«ãŒããŸã³ãããã»ã¹ãšããŠçµäºããŸãã
root<strong i="8">@034c4212d022</strong>:~/mnt# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
user+ 1 6.7 2.8 92167056 1876612 ? Ssl 13:50 0:36 python -m runner
user+ 38 1.9 0.0 0 0 ? Z 13:51 0:08 [python] <defunct>
user+ 39 4.3 2.3 91069804 1550736 ? Sl 13:51 0:19 python -m runner
user+ 40 2.0 0.0 0 0 ? Z 13:51 0:09 [python] <defunct>
user+ 41 4.1 0.0 0 0 ? Z 13:51 0:18 [python] <defunct>
ç§ã®èšå®ã§ã¯ãããŒã¿ã»ããã¯NFSãä»ããŠèªã¿åããããããã¯ãŒã¯ãã£ã¹ã¯äžã«ãããŸãã pin_memory = False
ãšnum_workers = 4
ã䜿çšãããšãã·ã¹ãã ã®é害ãããªãæ©ãçºçãããããšãã§ããŸãã
=> creating model 'resnet18'
- training epoch 0
Epoch: [0][0/5005] Time 10.713 (10.713) Data 4.619 (4.619) Loss 6.9555 (6.9555) Prec<strong i="8">@1</strong> 0.000 (0.000) Prec<strong i="9">@5</strong> 0.000 (0.000)
Traceback (most recent call last):
--snip--
imagenet_pytorch.main.main([data_dir, "--transient_dir", context.transient_dir])
File "/home/user/mnt/imagenet_pytorch/main.py", line 140, in main
train(train_loader, model, criterion, optimizer, epoch, args)
File "/home/user/mnt/imagenet_pytorch/main.py", line 168, in train
for i, (input, target) in enumerate(train_loader):
File "/home/user/anaconda/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 206, in __next__
idx, batch = self.data_queue.get()
File "/home/user/anaconda/lib/python3.5/multiprocessing/queues.py", line 345, in get
return ForkingPickler.loads(res)
File "/home/user/anaconda/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
fd = df.detach()
File "/home/user/anaconda/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/user/anaconda/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/user/anaconda/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError
:
[Errno 104] Connection reset by peer
@ zym1010ãããã¯ãŒã¯ãã£ã¹ã¯ãŸãã¯åŸæ¥ã®å転ãã£ã¹ã¯ããããŸãããã¬ã€ãã³ã·ãŒãªã©ãé ããªãå¯èœæ§ããããŸããïŒ
@jsainio
ã¯ã©ã¹ã¿ãŒã®èšç®ããŒãã§ããŒã«ã«SSDã䜿çšããŠããŸããã³ãŒãã¯NFSãã©ã€ãã«ãããŸãããããŒã¿ã¯ããŒã«ã«SSDã«ãããæ倧ã®èªã¿èŸŒã¿é床ãå®çŸããŠããŸãã NFSãã©ã€ãã«ããŒã¿ãããŒãããããšããããšã¯ãããŸããã
@ zym1010æ å ±ãããããšãã ãããã¯ã©ã¹ã¿ãŒã®èšç®ããŒãã§å®è¡ããŠããŸãã
å®éã num_workers = 4
ããªãšãŒã·ã§ã³ãè©ŠããŠããéãåãããŒãã§åæã«num_workers = 0
å®éšãå®è¡ããŠããŸãã æåã®å®éšã¯ååãªè² è·ãçæããŠãããããåŸè
ã§ã¯ç«¶åç¶æ
ãããæ©ãçŸããå¯èœæ§ããããŸãã
@apaszke以åã«ãããåçŸããããšãããšãã2ã€ã®ã€ã³ã¹ã¿ã³ã¹ã䞊ã¹ãŠå®è¡ããããã·ã¹ãã ã«ä»ã®éèŠãªè² è·ãããããããŠå®è¡ããããšããŸãããïŒ
@jsainioããã調æ»ããŠãããŠããããšãïŒ ããã¯å¥åŠãªããšã§ããã¯ãŒã«ãŒã¯äžç·ã«çµäºããå¿
èŠããããã¡ã€ã³ããã»ã¹ãå®äºãããããŒã¿ã®èªã¿åããè¡ããŸãã ãªãããããææå°æ©ã«çµäºããã®ãã調ã¹ãŠã¿ãããšãã§ããŸããïŒ ãã¶ãã«ãŒãã«ãã°ïŒ dmesg
ïŒããã§ãã¯ããŸããïŒ
ããããè©Šããããšã¯ãããŸããããIIRCã§ã¯ãªãå Žåã§ã衚瀺ãããŠããããã§ãã
@apaszkeããããŸãããåŽåè ãéåºããã¹ãã§ã¯ãªãã£ãããšãç¥ã£ãŠãããšããã§ãããã
è©ŠããŸãããããªãçµäºããã®ãã確èªããè¯ãæ¹æ³ãããããŸããã dmesg
ã¯ãé¢é£ãããã®ã¯äœã衚瀺ãããŸããã ïŒç§ã¯ãAnacondaããã±ãŒãžã䜿çšããŠUbuntu 16.04ãã掟çããDockerã§å®è¡ããŠããŸãïŒ
1ã€ã®æ¹æ³ã¯ãã¯ãŒã«ãŒã«ãŒãå ã«å€æ°ã®å°å·ãè¿œå ããããšã§ãã ãªã圌ããé»ã£ãŠåºãã®ãç§ã«ã¯åãããŸããã stderrã«åºåããããããããããäŸå€ã§ã¯ãããŸãããã«ãŒãããæãåºãããOSã«ãã£ãŠïŒããããã·ã°ãã«ã«ãã£ãŠïŒåŒ·å¶çµäºãããŸãã
@jsainio ã念ã®ããã«èšã£ãŠãããŸããã-ipc = hostã䜿çšããŠdockerãå®è¡ããŠããŸããïŒããã«ã€ããŠã¯èšåããŠããŸããïŒïŒ å ±æã¡ã¢ãªã»ã°ã¡ã³ãã®ãµã€ãºïŒdf -h | grep shmïŒã確èªã§ããŸããïŒ
@ngimelç§ã¯--shm-size=1024m
ãŸãã df -h | grep shm
ã¯ããã«å¿ããŠå ±åããŸãïŒ
root<strong i="9">@db92462e8c19</strong>:~/mnt# df -h | grep shm
shm 1.0G 883M 142M 87% /dev/shm
ãã®äœ¿çšæ³ã¯ããªãé£ããããã§ãã ããã¯ã2人ã®ãŸã³ãã¯ãŒã«ãŒããã枯湟åŽåè ã§ãã
shmã®ãµã€ãºã倧ããããŠã¿ãŠãã ããã 確èªãããšãããåé¡ãåçŸããããšãããµãŒããŒã§ã¯16GBã§ããã Dockerãã©ã°ãå€æŽããããå®è¡ããŸã
mount -o remount,size=8G /dev/shm
ãµã€ãºã512MBã«æžãããŠã¿ãŸãããããããããã¯ã§ã¯ãªãæ確ãªãšã©ãŒãçºçããŸããã ãŸã åçŸã§ããŸããð
dockerã䜿çšãããšãæ確ãªãšã©ãŒã¡ãã»ãŒãžã§ã¯ãªããshmãååã§ãªãå Žåã«ãããããã¯ãçºçããåŸåããããçç±ãããããŸããã ããããéåžžã¯shmãå¢ããããšã§è§£æ±ºããŸãïŒ1Gã§ãããããã¯ãçºçããŸããïŒã
ããããŸããã10人ã®ã¯ãŒã«ãŒã§ãšã©ãŒãçºçããããã§ããã4人ã®ã¯ãŒã«ãŒã䜿çšãããšã/ dev / shmã®äœ¿çšéã®58ïŒ ã§ãããããã¯ãçºçããŸãã ãã£ãšåçŸããŸãã
ãã®åé¡ã®åœ¢ãåçŸã§ããã®ã¯çŽ æŽãããããšã§ãã ïŒ1579ã§ãã³ã°ãããªã¬ãŒããã¹ã¯ãªãããæçš¿ããŸããããã·ã¹ãã ã§ãã³ã°ããªãã£ããšã®è¿ä¿¡ããããŸããã ç§ã¯å®éã«MacBookã§ãããã¹ãããŠããŸããã§ããã Linuxãè©ŠããŠã¿ãŸãããããã³ã°ããŸããã§ããã ãããã£ãŠãLinuxã§ã®ã¿è©Šããå Žåã¯ãMacã§ãè©Šã䟡å€ããããããããŸããã
ããŠãåé¡ã調æ»ããåŸãããã¯å¥åŠãªåé¡ã®ããã§ãã /dev/shm
ãµã€ãºã128MBã«å¶éããå Žåã§ããLinuxã¯ããã§147MBã®ãã¡ã€ã«ãäœæããããããå®å
šã«ã¡ã¢ãªã«mmapãããŸãããå®éã«ããŒãžã«ã¢ã¯ã»ã¹ããããšãããšãèŽåœçãªSIGBUSãã¯ãŒã«ãŒã«éä¿¡ããŸãã ... SIGBUSãã³ãã©ãŒãç»é²ããŠãããŒãžãç¹°ãè¿ãåŠçããåããŒãžã«ã¢ã¯ã»ã¹ãã以å€ã«ãããŒãžã®æå¹æ§ã確èªã§ããã¡ã«ããºã ã¯èããããŸãã...
ä»ã®ãšããåé¿çã¯ãäžã«ç€ºããããã«mount
ã³ãã³ãã䜿çšããŠ/dev/shm
ãå±éããããšã§ãã 16GBã§è©ŠããŠãã ããïŒååãªRAMãããå Žåã¯ãããããŸãïŒã
ããã«ã€ããŠã®èšåãèŠã€ããã®ã¯é£ããã§ãããããã«1ã€ãããŸãã
ãã®åé¡ã«ã€ããŠãæéãããã ãããããšãããããŸããããã¯é·ãéç§ãçãããŠããŸããïŒ æ£ããç解ã§ããã°ã /dev/shm
ã8Gã§ã¯ãªã16Gã«æ¡åŒµããå¿
èŠããããŸãã ããã¯æå³ããããŸããã df -h
ãè©ŠããŠã¿ããšããã¹ãŠã®RAMãå®éã«ãã®ããã«å²ãåœãŠãããŠããããšãããããŸã:(ç§ã¯16Gãæã£ãŠããŸãïŒ
tmpfs 7,8G 393M 7,4G 5% /dev/shm
tmpfs 5,0M 4,0K 5,0M 1% /run/lock
tmpfs 7,8G 0 7,8G 0% /sys/fs/cgroup
tmpfs 1,6G 60K 1,6G 1% /run/user/1001
ããã¯ããããããã¯äžã®df -h
ã®åºåã§ãã ç§ãç解ããŠããéãã16Gã®SWAPããŒãã£ã·ã§ã³ãããå Žåãæ倧32Gã®tmpfsãããŠã³ãã§ããã®ã§ã /dev/shm
ãæ¡åŒµããããšã¯åé¡ãããŸããããïŒ
ããã«éèŠãªããšã«ãRAMã®ååè¿ããå ãããããcgroupããŒãã£ã·ã§ã³ãšãã®ç®çã«æžæã£ãŠããŸãã ã©ãããããã¯å¹ççã«ãã«ãããã»ããµã¿ã¹ã¯ã管çããããã«èšèšãããŠããŸãããç§ã¯ãããäœãããã®ãããããŠãªããããå¿ èŠãªã®ãããããããŸãããããã¯ãã¹ãŠã®ç©çRAMãshmã«å²ãåœãŠãããã«äœããå€æŽããŸããïŒãµã€ãºã16Gã«èšå®ããããïŒãããŠãããSWAPã«å ¥ããŸãïŒãã ããäž¡æ¹ãéšåçã«RAMãšSWAPã«åæã«å«ãŸãããšæããŸãïŒ
@apaszkeããããšãïŒ æ ¹æ¬çãªåå ãèŠã€ããã®ã¯çŽ æŽãããããšã§ãã ãã·ã³ã«ä»ã«ã©ã®ãããªè² è·ãããã£ãŠãããã«å¿ããŠãããŸããŸãªãConnectionResetããšã©ãŒãšdocker --shm-size=1024m
ãããããã¯ã®äž¡æ¹ãçºçããããšããããŸããã --shm-size=16384m
ãš4人ã®ã¯ãŒã«ãŒã§ä»ãããã¹ãããŸãã
@jsainioConnectionResetã¯åãããšãåå ã§ããå¯èœæ§ããããŸãã ããã»ã¹ã¯äžéšã®ããŒã¿ã®äº€æãéå§ããŸããããshmãã¹ããŒã¹ã䜿ãæãããšãSIGBUSãã¯ãŒã«ãŒã«éä¿¡ããã匷å¶çµäºãããŸããã
@ClementPinardã¯ãRAMãäžè¶³ãããšãã·ã³ãããªãŒãºããå¯èœæ§ãããããšãé€ããŠãå¿
èŠãªã ã倧ããããããšãã§ããŸãïŒã«ãŒãã«ã§ãããã®ã¡ã¢ãªã解æŸã§ããªãããïŒã ãããã/sys/fs/cgroup
ã«ã€ããŠæ°ã«ããå¿
èŠã¯ãããŸããã tmpfs
ããŒãã£ã·ã§ã³ã¯ã¡ã¢ãªãé
延çã«å²ãåœãŠãŸãã䜿çšéã0Bã®ãŸãŸã§ããéããã³ã¹ãã¯ããããŸããïŒå¶éãå«ãïŒã ã¹ã¯ããã䜿çšãããšããŒã¿ã®èªã¿èŸŒã¿ãéåžžã«é
ããªããããã¹ã¯ããã䜿çšããã®ã¯è¯ãèãã§ã¯ãªããšæããŸãããã®ããã shm
ãµã€ãºã12GBã«å¢ãããã¯ãŒã«ãŒã®æ°ãå¶éããŠã¿ãŠãã ããïŒç§ãèšã£ãããã«ããã¹ãŠã®RAMãshmã«äœ¿çšããªãã§ãã ããïŒïŒã ããã¯ãã«ãŒãã«ã®ããã¥ã¡ã³ãããã®
/dev/shm
䜿çšéãéåžžã«å°ãªãå Žåã§ããããããã¯ãçºçããçç±ã¯ããããŸããïŒç§ã®ãã·ã³ã§ã¯20kBã§çºçããŸãïŒã ãããããã«ãŒãã«ã¯é床ã«æ¥œèŠ³çã§ããããã¹ãŠãæºãããããŸã§åŸ
æ©ããããã®é åããäœãã䜿çšãå§ãããšããã»ã¹ã匷å¶çµäºããŸãã
ä»12Gãšç§ãæã£ãŠããååã®åŽåè
ã§ãã¹ãããŸããããããŠããã¯å€±æããŸãã:(
ããã¯luaããŒãããŒãžã§ã³ïŒåãé床ãåãæ°ã®ã¯ãŒã«ãŒïŒã®é
åã®ããã«æ©èœããŠããŸãããããã¯ãåé¡ã/dev/shm
é¢é£ããŠããŠãPythonãã«ãããã»ãã·ã³ã°ã«è¿ããªãã®ã§ã¯ãªãããšæããŸã...
ïŒããªããèšã£ãããã«ïŒããã«ã€ããŠã®å¥åŠãªããšã¯ã /dev/shm
ãæºæ¯ã«è¿ã¥ãããšã¯æ±ºããŠãªããšããããšã§ãã æåã®ãã¬ãŒãã³ã°ãšããã¯ã§ã¯ã500Moãè¶
ããããšã¯ãããŸããã§ããã ãŸããæåã®ãšããã¯ã§ããã¯ãããããšã¯ãªãããã¹ããã·ã£ããããŠã³ããŠããtrainloaderããã¹ãŠã®ãšããã¯ã§å€±æããããšã¯ãããŸããã ãããããã¯ã¯ããã¹ããšããã¯ãéå§ãããšãã«ã®ã¿è¡šç€ºãããããã§ãã é»è»ãããã¹ãã«è¡ããšãã¯/dev/shm
远跡ããå¿
èŠããããŸãããããããããŒã¿ããŒããŒã®å€æŽäžã«ããŒã¯äœ¿çšéãçºçããå¯èœæ§ããããŸãã
@ClementPinardã¯ãå ±æã¡ã¢ãªãé«ããŠããDockerããªããŠãã倱æããå¯èœæ§ããããŸãã
ããŒãããŒãžã§ã³== Lua Torchã®å Žåã§ãã /dev/shm
é¢é£ããŠããå¯èœæ§ããããŸãã Lua Torchã¯ã¹ã¬ããã䜿çšã§ããããïŒGILã¯ãããŸããïŒãå
±æã¡ã¢ãªãçµç±ããå¿
èŠã¯ãããŸããïŒãã¹ãŠãåäžã®ã¢ãã¬ã¹ç©ºéãå
±æããŸãïŒã
æ°ãããã¬ãŒãã³ã°ãŸãã¯æ€èšŒãšããã¯ã®éå§æã«ã¡ã¢ãªãå²ãåœãŠãããšãã§ããªããšæå¥ãèšã£ãåŸãããŒã¿ããŒããŒãã¯ã©ãã·ã¥ãããšããåãåé¡ããããŸããã äžèšã®è§£æ±ºçã¯ç§ã«ã¯æ©èœããŸããã§ããïŒiïŒç§ã®/dev/shm
ã¯32GBã§ããã2.5GBãè¶
ããŠäœ¿çšãããããšã¯ãªããïŒiiïŒpin_memory = Falseã®èšå®ã¯æ©èœããŸããã§ããã
ããã¯ããããã¬ããŒãžã³ã¬ã¯ã·ã§ã³ãšé¢ä¿ããããŸããïŒ ç§ã®ã³ãŒãã¯ãããã次ã®ããã«ãªããŸãã ç¡éã®ã€ãã¬ãŒã¿ãå¿
èŠãªã®ã§ã以äžã®next()
ãé€ããŠ/ãè©ŠããŠã¿ãŸã:-)
def train():
train_iter = train_loader.__iter__()
for i in xrange(max_batches):
try:
x, y = next(train_iter)
except StopIteration:
train_iter = train_loader.__iter__()
...
del train_iter
train_loader
ã¯DataLoader
ãªããžã§ã¯ãã§ãã é¢æ°ã®æåŸã«æ瀺çãªdel train_iter
è¡ããªããšãããã»ã¹ã¯åžžã«2ã3ãšããã¯åŸã«ã¯ã©ãã·ã¥ããŸãïŒ /dev/shm
åŒãç¶ã2.5 GBã瀺ããŸãïŒã ã圹ã«ç«ãŠãã°ïŒ
ç§ã¯4
ã¯ãŒã«ãŒã䜿çšããŠããŸãïŒUbuntu16.04ã®CUDA8.0ã§ããŒãžã§ã³0.1.12_2
ïŒã
ãŸããç¹ã«work_numberã倧ããå Žåããããããã¯ã«ééããŸããã ãã®åé¡ã®å¯èœãªè§£æ±ºçã¯ãããŸããïŒ ç§ã®/ dev / shmãµã€ãºã¯32GBã§ãcuda 7.5ãpytorch 0.1.12ãpython2.7.13ã§ãã 以äžã¯æ»äº¡åŸã®é¢é£æ å ±ã§ãã ããã¯èšæ¶ã«é¢ä¿ããŠããããã§ãã @apaszke
@zhengyunqqãTrue
èšå®ããå Žåã¯ã pin_memory=False
è©ŠããŠãã ããã ããã§ãªããã°ãç§ã¯è§£æ±ºçãç¥ããŸããã
num_workersã倧ãããšãã«ããããããã¯ã«ééããŸããã
ç§ã«ãšã£ãŠã®åé¡ã¯ãäœããã®çç±ã§ã¯ãŒã«ãŒã¹ã¬ãããåæ¢ããå Žåã index_queue.put
ãæ°žä¹
ã«ãã³ã°ããããšã§ããã åäœäžã®ã¹ã¬ãããåæ¢ããçç±ã®1ã€ã¯ãåæåäžã«ã¢ã³ããã«ãŒã倱æããããšã§ãã ãã®å Žåã2017幎5æã«ãã¹ã¿ãŒã§ãã®Pythonã®ãã°ä¿®æ£ãè¡ããããŸã§ãã¯ãŒã«ãŒã¹ã¬ãããåæ¢ããç¡éã®ãã³ã°ãçºçããŠããŸããã ç§ã®å Žåããã³ã°ã¯ãããããªãã§ãããã©ã€ãã³ã°æ®µéã§çºçããŠããŸããã
ãã¶ãã SimpleQueue
䜿çšãããŠããDataLoaderIter
ãQueue
眮ãæãããšãé©åãªäŸå€ã¡ãã»ãŒãžã§ã¿ã€ã ã¢ãŠããå¯èœã«ãªããŸãã
UPDïŒç§ã¯ééã£ãŠããŸããããã®ãã°ä¿®æ£ã¯SimpleQueue
ã§ã¯ãªãQueue
ã«ããããåœãŠãŸãã ãªã³ã©ã€ã³ã®ã¯ãŒã«ãŒã¹ã¬ããããªãå Žåã SimpleQueue
ãããã¯ãããã®ã¯äºå®ã§ãã ãããã®è¡ãself.workers = []
眮ãæããããšã確èªããç°¡åãªæ¹æ³ã§ãã
ç§ã¯åãåé¡ãæ±ããŠããŸãããããŠç§ã¯ïŒèš±å¯ãªãã«ïŒshmãå€æŽããããšã¯ã§ããŸãããå€åãã¥ãŒãäœãä»ã®ãã®ã䜿ãã»ããè¯ãã§ããïŒ
åæ§ã®åé¡ããããŸãã
ãã®ã³ãŒãã¯ããªãŒãºããäœãåºåããŸããã num_workers = 0ã«èšå®ãããšãæ©èœããŸãã
dataloader = DataLoader(transformed_dataset, batch_size=2, shuffle=True, num_workers=2)
model.cuda()
for i, batch in enumerate(dataloader):
print(i)
model.cudaïŒïŒãã«ãŒãã®åŸãã«çœ®ããšããã¹ãŠãæ£åžžã«å®è¡ãããŸãã
dataloader = DataLoader(transformed_dataset, batch_size=2, shuffle=True, num_workers=2)
for i, batch in enumerate(dataloader):
print(i)
model.cuda()
誰ãããã®åé¡ã®è§£æ±ºçãæã£ãŠããŸããïŒ
ImageNetã®ãã¬ãŒãã³ã°äžã«ãåæ§ã®åé¡ãçºçããŸããã ããã¯ãç¹å®ã®ã¢ãŒããã¯ãã£ãåããç¹å®ã®ãµãŒããŒã§äžè²«ããŠè©äŸ¡ã®æåã®å埩ã§ãã³ã°ããŸãïŒåãã¢ãŒããã¯ãã£ãåããä»ã®ãµãŒããŒããŸãã¯ç°ãªãã¢ãŒããã¯ãã£ãåããåããµãŒããŒã§ã¯ãã³ã°ããŸããïŒããæ€èšŒã®è©äŸ¡äžã¯åžžã«æåã®å埩ã§ãã Torchã䜿çšããŠãããšãã«ãncclããã®ãããªãããããã¯ãåŒãèµ·ããå¯èœæ§ãããããšãããããŸãããããããªãã«ããæ¹æ³ã¯ãããŸããïŒ
ç§ã¯åãåé¡ã«çŽé¢ããŠãããæåã®ãšããã¯ã®éå§æã«ã©ã³ãã ã«ã¹ã¿ãã¯ããŸããäžèšã®ãã¹ãŠã®åé¿çã¯ç§ã«ã¯æ©èœããŸãããCtrl-CãæŒããšã次ã®ããã«åºåãããŸãã
Traceback (most recent call last):
File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 44, in _worker_loop
data_queue.put((idx, samples))
File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/queues.py", line 354, in put
self._writer.send_bytes(obj)
File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/connection.py", line 398, in _send_bytes
self._send(buf)
File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
KeyboardInterrupt
Traceback (most recent call last):
File "scripts/train_model.py", line 640, in <module>
main(args)
File "scripts/train_model.py", line 193, in main
train_loop(args, train_loader, val_loader)
File "scripts/train_model.py", line 341, in train_loop
ee_optimizer.step()
File "/home/zhangheng_li/applications/anaconda3/lib/python3.6/site-packages/torch/optim/adam.py", line 74, in step
p.data.addcdiv_(-step_size, exp_avg, denom)
KeyboardInterrupt
Dockerå ã®1人ã®ã¯ãŒã«ãŒã§ãããããã¯ãçºçãããšããåæ§ã®åé¡ããããç§ã®å Žåã¯å ±æã¡ã¢ãªã®åé¡ã§ããããšã確èªã§ããŸãã ããã©ã«ãã§ã¯ãdockerã¯64MBã®å ±æã¡ã¢ãªããå²ãåœãŠãŠããªãããã§ããã1人ã®ã¯ãŒã«ãŒã«440MBãå¿ èŠã§ãããããã«ããã@ apaszkeã§èª¬æãããŠããåäœãçºçããå¯èœæ§ããããŸãã
ç§ã¯åãåé¡ã«æ©ãŸãããŠããŸããããã®ã¹ã¬ããã®ä»ã®ã»ãšãã©ã®ç°å¢ãšã¯ç°ãªãç°å¢ã«ããã®ã§ãç§ã®å ¥åãæ ¹æ¬çãªåå ãç¹å®ããã®ã«åœ¹ç«ã€å¯èœæ§ããããŸãã ç§ã®pytorchã¯ãWindows10ã§peterjc123ã«ãã£ãŠæ§ç¯ãããåªããcondaããã±ãŒãžã䜿çšããŠã€ã³ã¹ããŒã«ãããŸãã
cifar10ããŒã¿ã»ããã§cnnãå®è¡ããŠããŸãã ããŒã¿ããŒããŒã®å Žåãnum_workersã¯1ã«èšå®ãããŸããnum_workersã0ãã倧ãããšãBrokenPipeErrorãçºçããããšãããã£ãŠãããïŒ494ã§ã¢ããã€ã¹ãããŠããŸãããç§ãçµéšããŠããã®ã¯BrokenPipeErrorã§ã¯ãªããã¡ã¢ãªå²ãåœãŠãšã©ãŒã§ãã ãšã©ãŒã¯åžžã«ãæåŸã®ãšããã¯ã®æ€èšŒçŽåŸã§ã次ã®ãšããã¯ã®ãã¬ãŒãã³ã°ã®éå§åã®çŽ50ãšããã¯ã§çºçããŸããã 90ïŒ ã®ç¢ºçã§æ£ç¢ºã«50ãšããã¯ã§ããããã以å€ã®å Žåã¯1ãŸãã¯2ãšããã¯ãããŸãã ãã以å€ã¯ãã¹ãŠããªãäžè²«ããŠããŸãã num_workers = 0ã«èšå®ãããšããã®åé¡ã解æ¶ãããŸãã
@paulguerreroã¯æ£ããã§ãã å ±æã¡ã¢ãªã64Mãã2Gã«å¢ããããšã§ããã®åé¡ã解決ããŸããã DockerãŠãŒã¶ãŒã«ãšã£ãŠã¯äŸ¿å©ãããããŸããã
@berzjacksonããã¯condaããã±ãŒãžã®æ¢ç¥ã®ãã°ã§ãã ææ°ã®CIãã«ãã§ä¿®æ£ãããŸããã
æææ¥ã«Pytorchã䜿çšããæ°ããã³ãŒã¹ãéå§ãã人ã¯çŽ600人ã§ãã ç§ãã¡ã®ãã©ãŒã©ã ã®å€ãã®äººã ããã®åé¡ãå ±åããŠããŸãã äžéšã¯AWSP2ã«ãããäžéšã¯ç¬èªã®ã·ã¹ãã ïŒäž»ã«GTX 1070ãäžéšã¯Titan XïŒã«ãããŸãã
ãã¬ãŒãã³ã°ãäžæãããšãã¹ã¿ãã¯ãã¬ãŒã¹ã®æåŸã«æ¬¡ã®ããã«è¡šç€ºãããŸãã
~/anaconda2/envs/fastai/lib/python3.6/multiprocessing/connection.py in _recv_bytes(self, maxsize)
405
406 def _recv_bytes(self, maxsize=None):
--> 407 buf = self._recv(4)
408 size, = struct.unpack("!i", buf.getvalue())
409 if maxsize is not None and size > maxsize:
~/anaconda2/envs/fastai/lib/python3.6/multiprocessing/connection.py in _recv(self, size, read)
377 remaining = size
378 while remaining > 0:
--> 379 chunk = read(handle, remaining)
380 n = len(chunk)
381 if n == 0:
num_workers = 4ãpin_memory = FalseããããŸãã å ±æã¡ã¢ãªã®èšå®ã確èªããããã«äŸé ŒããŸãããããã®åé¡ã解決ããããã«ã§ããããšïŒãŸãã¯Pytorchã§ã§ããããšïŒã¯ãããŸããïŒ ïŒnum_workersãæžãã以å€ã¯ãããªãé ããªãããã§ããïŒ
ç§ã¯@ jph00ã®ã¯ã©ã¹ã«ã1åã ãçºçããŸãã
ã€ãŸããããŒã¿ãèªã¿èŸŒãŸãããã£ããã£ã³ã°ã1åå®è¡ããããšã移åããŠæé ãç¹°ãè¿ãç¶ããããšãã§ããŸã... 4 num_workersã䜿çšããŠããGPUã§ã¯ãã¹ãŠãæåŸ ã©ããã«é«éã«åäœããããã§ãã
ç§ã¯PyTorch0.2.0_4ãPython 3.6.2ãTorchvision 0.1.9ãUbuntu 16.04LTSã䜿çšããŠããŸãã ã¿ãŒããã«ã§ãdf-hããå®è¡ãããšã䜿çšçã¯éåžžã«äœããã®ã®ã/ dev / shmã«16GBããããšãããããŸãã
ããã¯ãããŒãã倱æããå Žæã®ã¹ã¯ãªãŒã³ã·ã§ããã§ãïŒããŒã¿ã«num_workers = 0ã䜿çšããããšã«æ³šæããŠãã ããïŒ
ïŒå°ããªæåã«ã€ããŠã¯ç³ãèš³ãããŸããããã¹ãŠããã£ããã£ããã«ã¯ãºãŒã ã¢ãŠãããå¿
èŠããããŸãã...ïŒ
@apiltamangãããåãåé¡ãã©ããã¯
ãã®ã§ããã ãæ©ã調ã¹ãŠãã ããïŒ
ç§ã¯ãã¡ããã®ãã©ã€ããŒããã©ãŒã©ã ãžã®ã¢ã¯ã»ã¹@apaszkeäžããŠããããšç§ã¯åœŒãã®ããã¯ã¹ã«ç§éã®ãã°ã€ã³ãžã®ã¢ã¯ã»ã¹æš©ãäžããåé¡ãåŠçã«æ±ããŠããŸãã@soumithã
@ jph00ããã«ã¡ã¯ãžã§ã¬ããŒãåŠçã®ããããã@apaszkeã¯åè¿°ããããã«SHMãå¢ãããŠã¿ãã®ã§ããïŒ åœ¹ã«ç«ã¡ãŸãããïŒ
@SsnLã®åŠçã®äžäººã¯ãå ±æã¡ã¢ãªãå¢ããããããŸã åé¡ãããããšã確èªããŸããã ä»ã®äººã«ã確èªããŠããããŸããã
@ jph00ããããšãïŒ å ±æã¡ã¢ãªãå°ãªãããããã³ã°ãæ£åžžã«åçŸã§ããŸããã åé¡ãä»ã®å Žæã«ããå Žåã¯ãããã«æ·±ãæãäžããå¿ èŠããããŸãã ã¹ã¯ãªãããç§ãšå ±æããŠãããããã§ããïŒ
確ãã«-ãããç§ãã¡ã䜿çšããŠããããŒãããã¯ã§ãïŒ https ïŒ
è€è£œã§ããå ±æã¡ã¢ãªã®åé¡ã«åºã¥ããŠãã©ã€ãã©ãªãŸãã¯ããŒãããã¯ã«è¿œå ããŠåé¿ã§ããåé¿çã¯ãããŸããïŒ
@ jph00ä»ããã³ãŒãã«é£ã³èŸŒã¿ãŸãã å ±æã¡ã¢ãªã®äœ¿çšéãæžããæ¹æ³ãèŠã€ããããšããŸãã ã¹ã¯ãªããã§å€§éã®shmã䜿çšããå¿ èŠã¯ãªãããã§ãã®ã§ãåžæããããŸãã
ãŸããPRãéä¿¡ããŠãshmå¶éã«éãããšãã«ãåã«ãã³ã°ãããã®ã§ã¯ãªããé©åãªãšã©ãŒã¡ãã»ãŒãžã衚瀺ããŸãã
OKææ°ã®Pytorchcondaã€ã³ã¹ããŒã«ãåããCUDA9 AMIã䜿çšããŠãæ°ããAWSP2ã€ã³ã¹ã¿ã³ã¹ã§åé¡ãåçŸããŸããã å ¬ééµãæäŸããŠããã ããã°ãçŽæ¥è©ŠããŠã¿ãããšãã§ããŸãã ç§ã®ã¡ãŒã«ã¢ãã¬ã¹ã¯fast.aiã§ã®ç§ã®åã®æåã®æåã§ã
@ jph00ããªãã«ã¡ãŒã«ãéããŸãã:)ããããšãïŒ
@ jph00ãããŠ
ããŠãç§ã¯åºæ¬çãªåé¡ãç解ããŸãããããã¯ãopencvãšPytorchãã«ãããã»ãã·ã³ã°ãäžç·ã«ããŸãæ©èœããªãå Žåããããšããããšã§ãã 倧åŠã®ããã¯ã¹ã«ã¯åé¡ã¯ãããŸããããAWSã«ã¯å€ãã®åé¡ããããŸãïŒP2ã€ã³ã¹ã¿ã³ã¹ã䜿çšããæ°ãããã£ãŒãã©ãŒãã³ã°CUDA 9 AMIïŒã ãã¹ãŠã®cv2åŒã³åºãã®åšãã«ããã¯ãè¿œå ããŠãä¿®æ£ãããŸããããŸãã cv2.setNumThreads(0)
ãè¿œå ããŠãä¿®æ£ãããŸããã ããã¯ãããä¿®æ£ããããã§ãïŒ
from multiprocessing import set_start_method
set_start_method('spawn')
ãã ããããã¯ããã©ãŒãã³ã¹ã«çŽ15ïŒ åœ±é¿ããŸãã opencv githubã®åé¡ã§æšå¥šãããŠããã®ã¯ã httpsïŒ//github.com/tomMoral/lokyã䜿çšããããšã§ãã ç§ã¯ä»¥åã«ãã®ã¢ãžã¥ãŒã«ã䜿çšããããšãããããããå å®ã§ããããšãããããŸããã ä»ã®ãšããååã«æ©èœãããœãªã¥ãŒã·ã§ã³ãããã®ã§ãç·æ¥ã§ã¯ãããŸããããããŒã¿ããŒããŒã«Lokyã䜿çšããããšãæ€èšãã䟡å€ããããããããŸããã
ãããããã£ãšéèŠãªã®ã¯ãå°ãªããšãpytorchã®ãã¥ãŒã«äœããã®ã¿ã€ã ã¢ãŠããããããããã®ç¡éã®ãã³ã°ããã£ãããããããã«ãããšäŸ¿å©ã§ãã
åèãŸã§ã«ããspawnãã«ãã£ãŠäžéšã®ããŒãã2ã3åé ããªã£ããããå¥ã®ä¿®æ£ãè©ŠããŸãããã€ãŸããããŒã¿ããŒããŒããã°ããå埩ããã»ã¯ã·ã§ã³ã«ã©ã³ãã ãªã¹ãªãŒããããã€ãè¿œå ããŸããã ããã¯ãŸãåé¡ãä¿®æ£ããŸãã-ããããçæ³çã§ã¯ãããŸãããïŒ
ãããæãäžããŠãããŠããããšãïŒ 2ã€ã®åé¿çãèŠã€ãã£ãããšãç¥ã£ãŠããããã§ãã å®éãããŒã¿ã»ãããžã®ã€ã³ããã¯ã¹äœæã«ã¿ã€ã ã¢ãŠããè¿œå ãããšããã§ãããã ææ¥ããã®ã«ãŒãã«ã€ããŠè©±ãåããæãè¿ããé£çµ¡ããããŸãã
cc @soumithã¯ãç§ãã¡ã調æ»ãããäœããããããã§ããïŒ
äžèšã®è°è«ã®ããã«ãã®ã¹ã¬ããã«æ¥ã人ã ã®ããã«ãopencvã®åé¡ã¯https://github.com/opencv/opencv/issues/5150ã§ããæ·±ãè°è«ãããŠã
OKç§ã¯ä»ãããé©åã«ä¿®æ£ããŠããããã§ã-DataloaderããŠãŒã¶ãŒProcessPoolExecutor.map()
æžãçŽãããã³ãœã«ã®äœæã芪ããã»ã¹ã«ç§»åããŸããã çµæã¯ãå
ã®ããŒã¿ããŒããŒã§èŠããããéããè©ŠããŠã¿ããã¹ãŠã®ã³ã³ãã¥ãŒã¿ãŒã§å®å®ããŠããŸãã ã³ãŒããéåžžã«åçŽã§ãã
誰ããããã䜿çšããããšã«èå³ããããªããããªãã¯ãããhttps://github.com/fastai/fastai/blob/master/fastai/dataloader.pyããåŸãããšãã§ã
APIã¯æšæºããŒãžã§ã³ãšåãã§ãããããŒã¿ã»ãããPytorchãã³ãœã«ãè¿ããªãããã«ããå¿ èŠããããŸããã€ãŸããnumpyé åãŸãã¯pythonãªã¹ããè¿ãå¿ èŠããããŸãã ç§ã¯ãããå€ãPythonã§åäœãããè©Šã¿ãããŠããŸããã®ã§ãããã«ããã€ãã®åé¡ããã£ããšããŠãé©ããªãã§ãããã
ïŒç§ããã®éãé²ãã çç±ã¯ãæè¿ã®GPUã§å€ãã®ç»ååŠç/æ¡åŒµãè¡ã£ããšãã«ãPytorch CPUã䜿çšããŠååŠçãè¡ã£ãå ŽåãGPUãããžãŒç¶æ ã«ä¿ã€ã®ã«ååãªé床ã§åŠçãå®äºã§ããªãããšãããã£ãããã§ããæäœ;ãã ããopencvã®äœ¿çšã¯ã¯ããã«é«éã§ãããçµæãšããŠGPUãååã«æŽ»çšããããšãã§ããŸãããïŒ
ããããããopencvã®åé¡ã§ãããªããããã«ã€ããŠç§ãã¡ãã§ããããšã¯ããŸããããŸããã ã¹ã¬ããããŒã«ãããå Žåããã©ãŒã¯ã¯å±éºã§ãã ã©ã³ã¿ã€ã äŸåé¢ä¿ãè¿œå ããããªããšæããŸãïŒçŸåšã¯ãããŸããïŒãç¹ã«ãPyTorchãã³ãœã«ãé©åã«åŠçã§ããªãããã§ãã ãããããã¯ã®åå ãçªãæ¢ãã
@ jph00 Pillow-SIMDãè©ŠããŸãããïŒ ããã¯ç®±ããåºããŠããŒãããžã§ã³ã§åäœããã¯ãã§ãããããŠç§ã¯ããã«ã€ããŠå€ãã®è¯ãããšãèããããšããããŸãã
ã¯ããç§ã¯æãç¥ã£ãŠããŸã-SIMDã¯ããç¥ã£ãŠããŸãã ãµã€ãºå€æŽããŒãããRGBå€æã®ã¿ãé«éåããŸãã
ããã§ã§ããããšããããããªãããšã«åæããŸããã ããã¯ãopencvã®åé¡ã§ã¯ãªãïŒpytorchã®ç¹æ®ãªã±ãŒã¹ã®ãã«ãããã»ãã·ã³ã°ã¢ãžã¥ãŒã«ã¯èšããŸã§ããªãããã®ã¿ã€ãã®pythonãã«ãããã»ãã·ã³ã°ãããäžè¬çã«ãµããŒããããšã¯äž»åŒµããŠããŸããïŒãPytorchã®åé¡ã§ããããŸããã ããããPytorchããšã©ãŒãçºçãããã«éãã«åŸ æ©ãããšããäºå®ã¯ãïŒIMOïŒä¿®æ£ã§ãããã®ã§ãããããäžè¬çã«ã¯ãå€ãã®è³¢ã人ã ãéå»æ°å¹Žéãåé¡ãåé¿ããæ¹åããããã«ãããã»ãã·ã³ã°ã¢ãããŒããäœæããããã«æžåœã«åãçµãã§ããŸããããã®ããã«ã å€éšã®äŸåé¢ä¿ãæã¡èŸŒãããšãªãã圌ãã䜿çšããã¢ãããŒãããåããããšãã§ããŸãã
Lokyã®èåŸã«ãã人ã ã®1人ã§ããOlivierGriselã¯ãPythonã§ã®ãã«ãããã»ãã·ã³ã°ã®ç¶æ ãèŠçŽããçŽ æŽãããã¹ã©ã€ãããããæã£ãŠããŸãïŒ http ïŒ
åé¡ã®ãªãæ°ããããŒã¿ããŒããŒãäœæããã®ã§ãã©ã¡ãã®æ¹æ³ã§ãããŸããŸããã ããããFWIWã¯ãpytorchã®ãã«ãããã»ãã·ã³ã°ãšä»ã®ã·ã¹ãã ãšã®çžäºäœçšããå°æ¥çã«ã¯ä»ã®äººã ã«ãšã£ãŠãåé¡ã«ãªããšèããŠããŸãã
ãã®äŸ¡å€ã«ã€ããŠã¯ãubuntu14.04ã®Python2.7ã§ãã®åé¡ãçºçããŸããã ãã€ããŒã¿ããŒãã¯sqliteã®ããŒã¿ããŒã¹ããèªã¿èŸŒãŸããŠå®å
šã«åããnum_workers=0
ã§ãæã
èŠããOKãnum_workers=1
ããããŠéåžžã«è¿
éã«ä»»æã®ããé«ãå€ã®ããã®ãããããã¯ã ã¹ã¿ãã¯ãã¬ãŒã¹ã¯ãããã»ã¹ãrecv_bytes
ãã³ã°ããŠããããšã瀺ããŠããŸãã
ããŸããããªãã£ãããšïŒ
--shm-size 8G
ãŸãã¯--ipc=host
æž¡ãecho 16834 | sudo tee /proc/sys/kernel/shmmni
ãå®è¡ããŠå
±æã¡ã¢ãªã»ã°ã¡ã³ãã®æ°ãå¢ãããŸãïŒç§ã®ãã·ã³ã§ã¯ããã©ã«ãã¯4096ã§ããïŒpin_memory=True
ãŸãã¯pin_memory=False
ãã©ã¡ãã圹ã«ç«ã¡ãŸããã§ããç§ã®åé¡ã確å®ã«ä¿®æ£ããã®ã¯ãã³ãŒããPython 3ã«ç§»æ€ããããšã§ãããPython3.6ã€ã³ã¹ã¿ã³ã¹ïŒAnacondaããïŒå ã§åãããŒãžã§ã³ã®Torchãèµ·åãããšãåé¡ãå®å šã«ä¿®æ£ãããããŒã¿ã®èªã¿èŸŒã¿ããã³ã°ããªããªããŸããã
@apaszkeã¯ãopencvã§ããŸãæ©èœããããšãéèŠã§ããçç±ã§ããåèãŸã§ã«ïŒãããŠãtorchsampleãåªãããªãã·ã§ã³ã§ã¯ãªãçç±-<200ç»å/ç§ã®å転ãåŠçã§ããŸãïŒïŒïŒ
誰ãããã®åé¡ã®è§£æ±ºçãèŠã€ããŸãããïŒ
@iqbaluäžèšã®ã¹ã¯ãªãããè©ŠããŠãã ããïŒ https ïŒ
ããã¯ç§ã®åé¡ã解決ããŸãããã num_workers=0
ãµããŒãããŠããŸããã
@elbaroå®éã«è©ŠããŠã¿ãŸããããç§ã®å Žåã¯è€æ°ã®ã¯ãŒã«ãŒããŸã£ãã䜿çšããŠããŸããã§ããã ããã§äœãå€æŽããŸãããïŒ
@iqbalu fast.aiããŒã¿ããŒããŒã¯ãã¯ãŒã«ãŒããã»ã¹ãçæããŸããã ã¹ã¬ããã®ã¿ã䜿çšãããããäžéšã®ããŒã«ã§ã¯è¡šç€ºãããªãå ŽåããããŸã
@apaszke @elbaro @ jph00 fast.aiã®ããŒã¿ããŒããŒã¯ãããŒã¿ã®èªã¿åãã10å以äžé ãããŸããã num_workers = 8ã䜿çšããŠããŸãã çç±ã¯äœã§ããããïŒ
ããŒã¿ããŒããŒãGILãæŸæ£ããªãããã±ãŒãžã䜿çšããŠããå¯èœæ§ããããŸã
@apaszkeããã€ãã®ãšããã¯ã®åŸãå ±æã¡ã¢ãªã®äœ¿çšéãå¢ãç¶ããçç±ã«ã€ããŠã®èãã ç§ã®å Žåãããã¯400MBããå§ãŸãããã®åŸãçŽ20ãšããã¯ããšã«400MBãã€å¢å ããŸãã ããããšãïŒ
@iqbaluã¯ããã§ã¯ãããŸããã ããã¯èµ·ãã£ãŠã¯ãªããªã
ç§ã¯å€ãã®ããšãè©ŠããŸãããã cv2.setNumThreads(0)
ã€ãã«ç§ã®åé¡ã解決ããŸããã
ããããšã@ jph00
ç§ã¯æè¿ãã®åé¡ã«æ©ãŸãããŠããŸãã cv2.setNumThreads(0)
ã¯ç§ã«ã¯æ©èœããŸããã ãã¹ãŠã®cv2ã³ãŒããå€æŽããŠã代ããã«scikit-imageã䜿çšããããšãã§ããŸãããåé¡ã¯äŸç¶ãšããŠååšããŸãã ãã®äžãç§ã¯/dev/shm
16Gãæã£ãŠããŸãã ãã®åé¡ã¯ãè€æ°ã®GPUã䜿çšããŠããå Žåã«ã®ã¿çºçããŸãã ãã¹ãŠãåäžã®GPUã§æ£åžžã«æ©èœããŸãã 誰ãã解決çã«ã€ããŠäœãæ°ããèããæã£ãŠããŸããïŒ
åããšã©ãŒã åäžã®GPUã䜿çšãããšããã®åé¡ãçºçããŸãã
ç§ã«ãšã£ãŠãopencvã¹ã¬ãããç¡å¹ã«ãããšãåé¡ã解決ããŸããã
cv2.setNumThreadsïŒ0ïŒ
pytorch 0.3ãcuda 8.0ãubuntu16.04ã§ããããããŸã
opencvã¯äœ¿çšãããŠããŸããã
ç§ã¯pytorch0.3ãcuda 8.0ãubuntu14.04ã䜿çšããŠããŸãã cv2.resizeïŒïŒã䜿ãå§ããåŸããã®ãã³ã°ã芳å¯ããŸãã
cv2.setNumThreadsïŒ0ïŒã§åé¡ã解決ããŸããã
2ã€ã®1080Tiãš32GBã®RAMãæèŒããã·ã¹ãã ã§ãpython 3.6ãpytorch 0.3.0ãcuda 8.0ãubuntu17.04ã䜿çšããŠããŸãã
èªåã®ããŒã¿ã»ããã«8ã€ã®ã¯ãŒã«ãŒã䜿çšãããšããããããã¯ãé »ç¹ã«çºçããŸãïŒæåã®ãšããã¯ã§çºçããŸãïŒã ã¯ãŒã«ãŒã4ã«æžãããšãããã¯æ¶ããŸãïŒ80ãšããã¯ãå®è¡ããŸããïŒã
ãããããã¯ãçºçããå Žåã§ããRAMã«æ倧10GBã®ç©ºã容éããããŸãã
ããã§ã¯ãã¹ã¯ãªãããçµäºããåŸã®ãã°ã確èªã§ããŸãïŒ https ïŒ
æŽæ°ïŒSHMMNIãå¢ããããšã§åé¡ã解決ã§ããããšã確èªããŸããã Ubuntu 17.04ã§ã¯ã kernel.shmmni=8192
ã/etc/sysctl.conf
ã«è¿œå ããŸããã
ãã®åé¡ãçºçããŠããŸããUbuntu17.10ãPython 3.6ãPytorch 0.3.1ãCUDA8.0ã ãããããã¯ãçºçããæéãäžè²«ããŠããªãããã«èŠããå Žåã¯ãRAMãååã«æ®ã£ãŠããŸããããã¯ã第1ãšããã¯åŸãŸãã¯200æ¥åŸã«çºçããå¯èœæ§ããããŸãã
kernel.shmmni=8192
ãšcv2.setNumThreads(0)
çµã¿åããã¯ãããæ¹åããããã§ãããåå¥ã«ã¯æ©èœããŸããã§ããã
ç§ã®å Žåãåãã§ãã num_workers = 4ãèšå®ãããšããããããã¯ãçºçããŸããã ç§ã¯Ubuntu17.10ãPytorch 0.3.1ãCUDA 9.1ãpython3.6ã䜿çšããŠããŸãã 4ã€ã®Pythonã¹ã¬ããããããCPUïŒ4ã³ã¢ïŒãã¢ã€ãã«ç¶æ ã®ãŸãŸã§ããéãããããã1.6GBã®ã¡ã¢ãªãå æããããšã芳å¯ãããŸãã num_workers = 0ã«èšå®ãããšããã®åé¡ã解決ããã®ã«åœ¹ç«ã¡ãŸãã
åãåé¡ãçºçããã¡ããã©1ã€ã®ãšããã¯ã®åŸã§ããªãŒãºããŸãããå°ããããŒã¿ã»ããã§ã¯å®éã«ã¯åçŸã§ããŸããã Dockerç°å¢ã§CUDA9.1ãPytorch 0.3.1ãPython3.6ã䜿çšããŠããŸãã
@ jph00ã®ããŒã¿ããŒããŒãè©ŠããŸãããããŠãŒã¹ã±ãŒã¹ã§ã¯ããªãé
ãããšãããããŸããã çŸåšã®ç§ã®åé¿çã¯ããã¹ãŠã®ãšããã¯ã®åã«PytorchDataLoaderãåäœæããããšã§ãã ããã¯ããŸãããããã§ãããæ¬åœã«éãã§ãã
Ubuntu 17.10ãCUDA 9.1ãPytorchãã¹ã¿ãŒïŒ19/04ã®æã«ã³ã³ãã€ã«ïŒã§ããŸã£ããåãåé¡ãçºçããŸããã ãŸããããŒã¿ã»ãããµãã¯ã©ã¹ã§OpenCVã䜿çšããŠããŸãã
次ã«ããã«ãããã»ãã·ã³ã°ã®éå§ã¡ãœããããforkserverããããspawnãã«å€æŽããããšã§ããããããã¯ãåé¿ããããšãã§ããŸããã
# Set multiprocessing start method - deadlock
set_start_method(forkserver')
# Set multiprocessing start method - runs fine
set_start_method('spawn')
ç§ã¯äžèšã®ã¢ãããŒãã®ã»ãšãã©ãã¹ãŠãè©ŠããŸããïŒ ãããã®ã©ããæ©èœããŸããã§ããïŒ
ãã®åé¡ã¯ãããŒããŠã§ã¢ã¢ãŒããã¯ãã£ãšã®éäºææ§ã«é¢é£ããŠããå¯èœæ§ããããPytorchãã©ã®ããã«åé¡ãåŒãèµ·ããã®ãããããŸããã Pytorchã®åé¡ã§ããå Žåãšããã§ãªãå ŽåããããŸãã
ãããç§ã®åé¡ãã©ã®ããã«è§£æ±ºããããã§ãïŒ
_BIOSãæŽæ°ããŸãïŒ
è©ŠããŠã¿ãŸãã å°ãªããšãããã¯ç§ã®åé¡ã解決ããŸãã
ãã£ã¡ãäžç·ã Ubuntu PyTorch 0.4ãpython3.6ã
åé¡ã¯pytorch0.4ãšpython3.6ã«ãŸã ååšããŠããããã§ãã ãããpytorchã®åé¡ã§ãããã©ããã¯ããããŸããã ç§ã¯opencvã䜿çšãã num_workers=8
ã pin_memory=True
ãŸãã äžèšã®ãã¹ãŠã®ããªãã¯ãè©Šãã cv2.setNumThreads(0)
ãèšå®ãããšåé¡ã解決ããŸãã
ïŒ1ïŒPyTorchããŒã¿ããŒãã§num_workers = 0ãèšå®ãããšãåé¡ã解決ããŸãïŒäžèšãåç
§ïŒãŸãã¯
ïŒ2ïŒcv2.setNumThreadsïŒ0ïŒã¯ãnum_workersãé©åºŠã«å€§ããå Žåã§ãåé¡ã解決ããŸãã
ããã¯ãããçš®ã®ã¹ã¬ããããã¯ã®åé¡ã®ããã«èŠããŸãã
ã¡ã€ã³ã®Pythonãã¡ã€ã«ã®å é ã«åãã£ãŠcv2.setNumThreadsïŒ0ïŒãèšå®ããŸãããããã以éããã®åé¡ãçºçããããšã¯ãããŸããã
ã¯ãããããã®åé¡ã®å€ãã¯ããµãŒãããŒãã£ã®ã©ã€ãã©ãªããã©ãŒã¯ã»ãŒãã§ã¯ãªãããšãåå ã§ãã 代æ¿ã®è§£æ±ºçã®1ã€ã¯ãspawnstartã¡ãœããã䜿çšããããšã§ãã
ç§ã®å Žåãã¢ãã«ãnn.DataParallelã§ã©ããããããŒã¿ããŒããŒã§num_workers> 0ã䜿çšãããšããããããã¯ã®åé¡ãçºçããŸãã nn.DataParallelã©ãããŒãåé€ããããšã§ãããã¯ããã«ã¹ã¯ãªãããå®è¡ã§ããŸãã
CUDA_VISIBLE_DEVICES = 0 python myscript.py --split 1
CUDA_VISIBLE_DEVICES = 1 python myscript.py --split 2
è€æ°ã®GPUããªããšãã¹ã¯ãªããã®å®è¡é床ã¯é ããªããŸãããããŒã¿ã»ããã®ç°ãªãåå²ã§åæã«è€æ°ã®å®éšãå®è¡ã§ããŸãã
Python 3.6.2 / Pytorch0.4.0ã§ãåãåé¡ãçºçããŸãã
ãããŠäœãããpin_memoryã®åãæ¿ããå
±æã¡ã¢ãªã®ãµã€ãºã®å€æŽã«åãçµã¿ãskiamgeã©ã€ãã©ãªã䜿çšããŠããŸãïŒcv2ã䜿çšããŠããŸãã!!ïŒããããã§ãåé¡ããããŸãã
ãã®åé¡ã¯ã©ã³ãã ã«çºçããŸãã ãã®åé¡ãå¶åŸ¡ããã«ã¯ãã³ã³ãœãŒã«ãç£èŠããŠãã¬ãŒãã³ã°ãåéããã ãã§ãã
@ jinh574ããŒã¿ããŒããŒã¯ãŒã«ãŒã®æ°ã0ã«èšå®ããã ãã§ãæ©èœããŸãã
@Shuailong倧ããªãµã€ãºã®ç»åã䜿çšããå¿ èŠããããããé床ãåå ã§ãã®ãã©ã¡ãŒã¿ãŒã䜿çšã§ããŸããã ç§ã¯ãã®åé¡ã«ã€ããŠãã£ãšèª¿ã¹ãå¿ èŠããããŸã
Python 3.6 / Pytorch0.4.0ã§ãåãåé¡ãçºçããŸãã pin_memory
ãªãã·ã§ã³ã¯äœãã«åœ±é¿ããŸããïŒ
collatââe_fnã䜿çšããŠããŠãnum_workers> 0ãPyTorchããŒãžã§ã³<0.4ã§äœ¿çšããŠããå ŽåïŒ
__getitem__()
é¢æ°ãã
ãŸãã¯ãããããNumpyé
åãšããŠè¿ããŸãã
num_workers = 0ãŸãã¯cv2.setNumThreadsïŒ0ïŒãèšå®ããåŸã§ãããã®åé¡ãçºçããŸãã
ãããã®2ã€ã®åé¡ã®ããããã§å€±æããŸãã åãããšã«çŽé¢ããŠããä»ã®èª°ãïŒ
ãã¬ãŒã¹ããã¯ïŒæåŸã®æåŸã®åŒã³åºãïŒïŒ
_run_module_as_mainã®ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/runpy.py"ãè¡193
"__main __"ãmod_specïŒ
ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/runpy.py"ã85è¡ç®ã_run_code
execïŒcodeãrun_globalsïŒ
ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/distributed/launch.pyââ"ãè¡209ã
äž»èŠïŒïŒ
ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/distributed/launch.pyââ"ãè¡205ãã¡ã€ã³
process.waitïŒïŒ
ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/subprocess.py"ã1457è¡ç®ãåŸ
æ©äž
ïŒpidãstsïŒ= self._try_waitïŒ0ïŒ
ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/subprocess.py"ã1404è¡ç®ã_try_wait
ïŒpidãstsïŒ= os.waitpidïŒself.pidãwait_flagsïŒ
KeyboardInterrupt
_bootstrapã®ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/process.py"ãè¡258
self.runïŒïŒ
ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/process.py"ãè¡93ãå®è¡äž
self._targetïŒ self._argsã* self._kwargsïŒ
_worker_loopã®ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py"ãè¡96
r = index_queue.getïŒtimeout = MANAGER_STATUS_CHECK_INTERVALïŒ
ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/queues.py"ã104è¡ç®ãget
self._pollïŒtimeoutïŒã§ãªãå ŽåïŒ
ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py"ã257è¡ç®ãæ祚
self._pollïŒtimeoutïŒãè¿ããŸã
ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py"ãè¡414ã_poll
r = waitïŒ[self]ãtimeoutïŒ
ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py"ãè¡911ãåŸ
æ©äž
æºåå®äº= selector.selectïŒtimeoutïŒ
ãã¡ã€ã« "/opt/conda/envs/pytorch-py3.6/lib/python3.6/selectors.py"ã376è¡ç®ãselect
fd_event_list = self._poll.pollïŒtimeoutïŒ
KeyboardInterrupt
ããŒãžã§ã³ã0.5.0a0 + f57e4ceãã䜿çšããŠããŸãããåãåé¡ãçºçããŸããã ãã©ã¬ã«ããŒã¿ããŒããŒã®ãã£ã³ã»ã«ïŒnum_workers = 0ïŒãŸãã¯cv2.setNumThreadsïŒ0ïŒã®èšå®ã®ãããããæ©èœããŸãã
ïŒ11985ã¯ããã¹ãŠã®ãã³ã°ã解æ¶ããã¯ãã ãšç§ã¯ããªã確信ããŠããŸãïŒç§ãã¡ãå¶åŸ¡ã§ããªãäžå¹žãªæéã«äžæããªãéãïŒã ããŒãžãããã®ã§ããããéããŸãã
cv2ã¯ãã«ãããã»ãã·ã³ã°ã§ããŸãæ©èœããªããããcv2ã§ã®ãã³ã°ãå¶åŸ¡ã§ããŸããã
torch_nightly-1.0.0.dev20181029
æç¹ã§ãŸã ãããçµéšããŠããŸãããPRã¯ãŸã ããã«ããŒãžãããŠããŸãããïŒ
@Evpokããã¯ããã§ããŒãžãããŸããã 確ãã«ãã®ããããå¿ èŠã§ãã ãã以äžã®é·åŒããããããã¯ãçºçããå¯èœæ§ããããã©ããçåã«æããŸãã ç§ãã¡ãèŠãŠã¿ãããšãã§ããç°¡åãªåçŸã¯ãããŸããïŒ
ãäžäŸ¿ããããããŠç³ãèš³ãããŸããããç§ã¯å®éã«ãããç§ã®åŽã®ç¡é¢ä¿ãªãã«ãããã»ãã·ã³ã°ã®æ··ä¹±ã«ãã©ããŸããã
ããã«ã¡ã¯@Evpok
ç§ã¯torch_nightly-1.0.0
ããã®åé¡ã«å¯ŸåŠããŸãã ããªãã¯ãã®åé¡ã解決ããŸãããïŒ
collatââe_fnã䜿çšããŠããŠãnum_workers> 0ãPyTorchããŒãžã§ã³<0.4ã§äœ¿çšããŠããå ŽåïŒ
__getitem__()
é¢æ°ãã
ãŸãã¯ãããããNumpyé åãšããŠè¿ããŸãã
ãŒãã®èæããã³ãœã«ãè¿ããã°ãä¿®æ£ããŸããããåé¡ã¯ãŸã ååšããŸãã
@ zimenglan-sysu-512äž»ãªåé¡ã¯ããã«ãããã»ãã·ã³ã°ã®å¶éã«ãããŸããã spawn
ãŸãã¯forkserver
ïŒCPU-GPUéä¿¡ã«å¿
èŠïŒã䜿çšããå Žåãããã»ã¹éã§ãªããžã§ã¯ããå
±æããããšã¯ããªãå¶éãããŠãããç§ãæäœããªããã°ãªããªãçš®é¡ã®ãªããžã§ã¯ãã«é©ããŠããŸãã
ããã¯ã©ããç§ã«ã¯ããŸããããŸããã§ããã ãã ããææ°ã®opencvã¯æ©èœããŸãïŒ 3.4.0.12
ãã3.4.3.18
ãŸã§å€æŽããå¿
èŠã¯ãããŸããïŒã
sudo pip3 install --upgrade opencv-python
@ see--opencvã圌ãã®ããšãä¿®æ£ããããšãç¥ã£ãŠããããã§ã:)
python2.7ã䜿çšããŠOpenCV3.4.3.18ã䜿çšããŠããŸããããããããã¯ãçºçããŠããã®ãããããŸãã ïŒ/
次ã®ããšãè©ŠããŠãã ããã
from torch.utils.data.dataloader import DataLoader
ãã以å€ã®
from torch.utils.data import DataLoader
ããã§ã®åãã§ãã¯ã«åé¡ããããšæããŸãã
https://github.com/pytorch/pytorch/blob/656b565a0f53d9f24547b060bd27aa67ebb89b88/torch/utils/data/dataloader.py#L816
次ã®ããšãè©ŠããŠãã ããã
from torch.utils.data.dataloader import DataLoader
ãã以å€ã®
from torch.utils.data import DataLoader
ããã§ã®åãã§ãã¯ã«åé¡ããããšæããŸãã
pytorch / torch / utils / data / dataloader.py
656b565ã®è¡816
superïŒDataLoaderãselfïŒ.__ setattr __ïŒattrãvalïŒ
ããã¯åãªããšã€ãªã¢ã¹ã§ã¯ãããŸãããïŒ torch.utils.data .__ init__ã§ã¯ãdataloader.DataLoaderãã€ã³ããŒãããŸã
ãŸããnum_workers> 0ã§ãã³ã°ããŠããŸãããç§ã®ã³ãŒãã«ã¯opencvããªãã /dev/shm
ã®ã¡ã¢ãªäœ¿çšéã¯åé¡ã§ã¯ãããŸããã äžèšã®ææ¡ã¯ç§ã«ã¯ããŸããããŸããã§ããã ç§ã®ä¿®æ£ã¯ãnumpyã1.14.1ãã1.14.5ã«æŽæ°ããããšã§ããïŒ
conda install numpy=1.14.5
ã圹ã«ç«ãŠã°å¹žãã§ãã
ããŒããç§ã®numpyããŒãžã§ã³ã¯1.15.4ãªã®ã§ã1.14.5ãããæ°ããã§ã...ãããªã倧äžå€«ã§ããïŒ
ããŒããç§ã®numpyããŒãžã§ã³ã¯1.15.4ãªã®ã§ã1.14.5ãããæ°ããã§ã...ãããªã倧äžå€«ã§ããïŒ
Idkãnumpyã®ã¢ããããŒãã§mklãã¢ããããŒããããŸããã
ã©ã®mklããŒãžã§ã³ããããŸããïŒ ç§ã®ã¯2019.1ïŒãã«ã144ïŒã§ãååã«mklãå«ãŸããŠããä»ã®ããã±ãŒãžã¯æ¬¡ã®ãšããã§ãã
mkl-service 1.1.2 py37he904b0f_5
mkl_fft 1.0.6 py37hd81dba3_0
mkl_random 1.0.2 py37hd81dba3_0
ã©ã®mklããŒãžã§ã³ããããŸããïŒ ç§ã®ã¯2019.1ïŒãã«ã144ïŒã§ãååã«mklãå«ãŸããŠããä»ã®ããã±ãŒãžã¯æ¬¡ã®ãšããã§ãã
mkl-service 1.1.2 py37he904b0f_5
mkl_fft 1.0.6 py37hd81dba3_0
mkl_random 1.0.2 py37hd81dba3_0
conda list | grep mkl
mkl 2018.0.1 h19d6760_4
mkl-service 1.1.2 py36h17a0993_4
ããã§ãææ°ã®pytorchã§ãã³ã°ãçºçããå Žåã¯ãåé¡ãåçŸããçãã¹ã¯ãªãããæäŸã§ãããšéåžžã«åœ¹ç«ã¡ãŸãã ããããšãïŒ
ãã®ãããããã¯ã¯ãŸã çºçããŠããŸããåçŸããã¹ã¯ãªãããäœæã§ãããã©ããã確èªããŸãã
pin_memory=True
åé¡ã解決ããŸããã
pin_memory=True
ã§ã¯ããŸããããªãããã§ããã70ãšããã¯ãéããŠãã¹ã¿ãã¯ããŸãã ãããŸã§ç§ã®ããã«åããã®ã¯num_workers=0
èšå®ããããšã ãã§ãããããã¯èããé
ããªããŸãã
ãŸãããããããã¯ãçºçããŠããŸãïŒããªãã©ã³ãã ã«çºçããŸãïŒã pin_memory
ãè©ŠããNumpyãæŽæ°ããŸããã å¥ã®ãã·ã³ã§å®è¡ããŠã¿ãŸãã
ããŒã¿ããŒããŒãå«ãè€æ°ã®ã¹ã¬ããã䜿çšããŠããå Žåã¯ããã«ãã¹ã¬ããã®ä»£ããã«ãã«ãããã»ãã·ã³ã°ã䜿çšããŠã¿ãŠãã ããã ããã§åé¡ã¯å®å šã«è§£æ±ºããŸããïŒã¡ãªã¿ã«ãGILããããããPythonã§ã®èšç®éã®å€ãã¿ã¹ã¯ã«ãé©ããŠããŸãïŒ
Pytorch1.0ãPillow5.0.0 numpy1.16.1python3.6ã§ãåããšã©ãŒ
åããšã©ãŒãçºçããŸãã pin_memory=True
ãšnum_workers=0
ã ããŒã¿ã»ããã®ããäžéšã䜿çšãããšããã®ãšã©ãŒã¯çºçããªãããšã«æ°ã¥ããŸããã ããŒã¿ã»ããå
šäœã䜿çšããã ãã§ãã®ãšã©ãŒãçºçããŸãã
ç·šéïŒã·ã¹ãã ãåèµ·åããã ãã§ä¿®æ£ãããŸããã
ç§ãåæ§ã®åé¡ãæ±ããŠããŸããã äžéšã®ã³ãŒãã§ã¯ããã®é¢æ°ã¯ïŒã»ãšãã©ã®å ŽåïŒd_iter.nextïŒïŒã§ãã³ã°ããŸãã
def get_next_batch(d_iter, loader):
try:
data, label = d_iter.next()
except StopIteration:
d_iter = iter(loader)
data, label = d_iter.next()
return data, label
ç§ã®ããã«åããããã¯ã¯ããã®é¢æ°ãåŒã³åºããåŸã«å°ãé 延ãè¿œå ããããšã§ãã
trn_X, trn_y = get_next_batch(train_data_iter, train_loader)
time.sleep(0.003)
val_X, val_y = get_next_batch(valid_data_iter, valid_loader)
é 延ã¯ãããããã¯ãåé¿ããã®ã«åœ¹ç«ã£ããšæããŸããïŒ
ç§ã¯ãŸã ãã®åé¡ã«çŽé¢ããŠããŸãã pytorch1.0ãšpython3.7ã䜿çšããŸãã è€æ°ã®data_loaderã䜿çšããŠããå Žåããã®ãã°ãçºçããŸãã 3ã€æªæºã®data_loaderã䜿çšããããåäžã®GPUã䜿çšããå Žåããã®ãã°ã¯çºçããŸããã è©ŠããïŒ
ç§ã®ãœãªã¥ãŒã·ã§ã³ã¯ãååŠçããã°ã©ã ã§cv2.setNumThreadsïŒ0ïŒãè¿œå ããŸã
ç§ã¯é»è»ãšãŽã¡ã«çšã®2ã€ã®ããŒã¿ããŒããŒãæã£ãŠããŸã
è©äŸ¡è
ãå®è¡ã§ããã®ã¯1åã ãã§ãã
pytorch1.1ã§ãã®ãã°ã«ééããŸããã åãå Žæã§2åã¹ã¿ãã¯ããŸããïŒ99ãšããã¯ã®çµããã pin_memory
ã¯False
ã«èšå®ãããŸããã
ã¯ãŒã«ãŒ> 0ã䜿çšããå Žåã®åãåé¡ããã³ã¡ã¢ãªã¯åé¡ã解決ããŸããã§ããã
ç§ã®ãœãªã¥ãŒã·ã§ã³ã¯ãååŠçããã°ã©ã ã§cv2.setNumThreadsïŒ0ïŒãè¿œå ããŸã
ç§ã¯é»è»ãšãŽã¡ã«çšã®2ã€ã®ããŒã¿ããŒããŒãæã£ãŠããŸã
è©äŸ¡è ãå®è¡ã§ããã®ã¯1åã ãã§ãã
ãã®è§£æ±ºçã¯ç§ã®ããã«åããŸããããããšã
ãšããã¯ãçµäºãããšããŒã¿ããŒããŒãåæ¢ããæ°ãããšããã¯ãéå§ãããŸãã
åãåé¡ã«å¯Ÿå¿ããŸãã ç§ã®å Žåãopencv-pythonãã€ã³ã¹ããŒã«ãããšåé¡ãçºçããŸãïŒä»¥åã«opencv3ãã€ã³ã¹ããŒã«ããŸããïŒã opencv-pythonã移åããåŸããã¬ãŒãã³ã°ã¯åæ¢ããŸããã
ãããè¯ãèãã§ã
2019-06-20 10:51:02ã«ããhongzhenwangã [email protected]ã¯æ¬¡ã®ããã«æžããŠããŸãã
ãšããã¯ãçµäºãããšããŒã¿ããŒããŒãåæ¢ããæ°ãããšããã¯ãéå§ãããŸãã
åãåé¡ã«å¯Ÿå¿ããŸãã ç§ã®å Žåãopencv-pythonãã€ã³ã¹ããŒã«ãããšåé¡ãçºçããŸãïŒä»¥åã«opencv3ãã€ã³ã¹ããŒã«ããŸããïŒã opencv-pythonã移åããåŸããã¬ãŒãã³ã°ã¯åæ¢ããŸããã
â
ã³ã¡ã³ãããã®ã§ãããåãåã£ãŠããŸãã
ãã®ã¡ãŒã«ã«çŽæ¥è¿ä¿¡ããããGitHubã§è¡šç€ºããããã¹ã¬ããããã¥ãŒãããŠãã ããã
ç§ã¯ãŸã ãã®åé¡ã«çŽé¢ããŠããŸãã pytorch1.0ãšpython3.7ã䜿çšããŸãã è€æ°ã®data_loaderã䜿çšããŠããå Žåããã®ãã°ãçºçããŸãã 3ã€æªæºã®data_loaderã䜿çšããããåäžã®GPUã䜿çšããå Žåããã®ãã°ã¯çºçããŸããã è©ŠããïŒ
1. time.sleep(0.003) 2. pin_memory=True/False 3. num_workers=0/1 4. from torch.utils.data.dataloader import DataLoader 5. writing 8192 to /proc/sys/kernel/shmmni None of them works. Don't know whether there is any solutions?
ãŸã åé¿çãèŠã€ããããšããŠããŸãã ç°ãªãGPUã§2ã€ã®äžŠåããã»ã¹ãåæã«å®è¡ããŠããå Žåã«ã®ã¿ããã®åé¡ãçºçããããã«æãããããšã«åæããŸãã äžæ¹ã¯åæ¢ããããäžæ¹ã¯åæ¢ããŸãã
num_workers = 4ã«èšå®ãããšãããã°ã©ã ã¯4ãããããšã«æ°ç§ïŒãŸãã¯æ°åïŒã¹ã¿ãã¯ããå€ãã®æéã浪費ããŸãã ããã解決ããæ¹æ³ã«ã€ããŠäœãã¢ã€ãã¢ã¯ãããŸããïŒ
ãã©ã°ãè¿œå ããïŒããŒã¿ããŒããŒã®pin_memory = Trueããã³num_workers = 0ã解決çã§ãïŒ
ãã©ã°ãè¿œå ããïŒããŒã¿ããŒããŒã®pin_memory = Trueããã³num_workers = 0ã解決çã§ãïŒ
@ArturoDeza
ããã¯è§£æ±ºçãããããŸããããnum_workers = 0ã«èšå®ãããšãCPUã®ããŒã¿ãã§ããå šäœãé ããªããGPUã®äœ¿çšçãéåžžã«äœããªããŸãã
ç§ã®å Žåããã®çç±ã¯ãã·ã¹ãã ã«ååãªCPUããªãããããŒã¿ããŒããŒã§æå®ãããnum_workers
äžè¶³ããŠããããã§ãã ããŒã¿ããŒããŒã®__get_item__
ã¡ãœãããnumpy
ã librosa
ã opencv
ãªã©ã®ã¹ã¬ããã©ã€ãã©ãªã䜿çšããå Žåã¯ãããŒã¿ããŒããŒã¯ãŒã«ãŒã®ã¹ã¬ãããç¡å¹ã«ããããšããå§ãããŸãã OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py
ããŠãã¬ãŒãã³ã°ã¹ã¯ãªãããå®è¡ããããšã§å®çŸã§ããŸãã 以äžã®èª¬æãæ確ã«ããããã«ãåããŒã¿ããŒããŒãããã¯åäžã®ã¯ãŒã«ãŒã«ãã£ãŠåŠçãããããšã«æ³šæããŠãã ãããåã¯ãŒã«ãŒã¯batch_size
ãµã³ãã«ãåŠçããŠåäžã®ããããå®äºããããŒã¿ã®æ°ãããããã®åŠçãéå§ããŸãã
num_workers
ãã·ã³ïŒãŸãã¯Kubernetesã䜿çšããŠããå Žåã¯ãããïŒã®CPUæ°ãããäœãèšå®ããå¿
èŠããããŸãããããŒã¿ãåžžã«æ¬¡ã®å埩ã«åããããããã«ååã«é«ãèšå®ããå¿
èŠããããŸãã GPUãåå埩ãt
ç§ã§å®è¡ããåããŒã¿ããŒããŒã¯ãŒã«ãŒãåäžã®ãããã®ããŒã/åŠçã«N*t
ç§ãããå Žåã¯ã num_workers
ãå°ãªããšãN
èšå®ããå¿
èŠããããŸãN
CPUãå¿
èŠã§ãã
æ®å¿µãªãããDataloaderãK
ã¹ã¬ããã䜿çšããã©ã€ãã©ãªã䜿çšããå Žåãçæãããããã»ã¹ã®æ°ã¯num_workers*K = N*K
ãŸãã ããã¯ããã·ã³ã®CPUã®æ°ããã倧å¹
ã«å€ãå¯èœæ§ããããŸãã ããã«ããããããæå¶ãããããŒã¿ããŒããŒãéåžžã«é
ããªããŸãã ããã«ãããããŒã¿ããŒããŒãtç§ããšã«ããããè¿ãããGPUãåæ¢ããå¯èœæ§ããããŸãã
K
ã¹ã¬ãããåé¿ãã1ã€ã®æ¹æ³ã¯ãã¡ã€ã³ã¹ã¯ãªãããOMP_NUM_THREADS=1 MKL_NUM_THREADS=1 python train.py
åŒã³åºãããšã§ãã ããã«ãããåDataloaderã¯ãŒã«ãŒãåäžã®ã¹ã¬ããã䜿çšããããã«å¶éããããã·ã³ãå§åããããšãåé¿ã§ããŸãã GPUã«çµŠé»ãç¶ããã«ã¯ããŸã ååãªnum_workers
ãå¿
èŠã§ãã
ãŸããåã¯ãŒã«ãŒãçæéã§ããããå®äºããããã«ãã³ãŒãã__get_item__
ã§æé©åããå¿
èŠããããŸãã ã¯ãŒã«ãŒã«ãããããã®ååŠçãå®äºããæéã¯ããã£ã¹ã¯ãããã¬ãŒãã³ã°ããŒã¿ãèªã¿åãæéïŒç¹ã«ãããã¯ãŒã¯ã¹ãã¬ãŒãžããèªã¿åãå ŽåïŒãŸãã¯ãããã¯ãŒã¯åž¯åå¹
ïŒãããã¯ãŒã¯ããèªã¿åãå ŽåïŒã«ãã£ãŠåŠšããããªãããã«ããŠãã ããããã£ã¹ã¯ïŒã ããŒã¿ã»ãããå°ãããååãªRAMãããå Žåã¯ãããŒã¿ã»ãããRAMïŒãŸãã¯/tmpfs
ïŒã«ç§»åããããããèªã¿åã£ãŠãã°ããã¢ã¯ã»ã¹ããããšãæ€èšããŠãã ããã Kubernetesã®å ŽåãRAMãã£ã¹ã¯ãäœæã§ããŸãïŒKubernetesã§emptyDir
ãæ€çŽ¢ããŠãã ããïŒã
__get_item__
ã³ãŒããæé©åãããã£ã¹ã¯ã¢ã¯ã»ã¹/ãããã¯ãŒã¯ã¢ã¯ã»ã¹ãåå ã§ã¯ãªãããšã確èªããããããã§ãã¹ããŒã«ãçºçããå Žåã¯ãããå€ãã®CPUããªã¯ãšã¹ããããïŒKubernetesãããã®å ŽåïŒãGPUãããå€ãã®CPUãæèŒãããã·ã³ã
ãã1ã€ã®ãªãã·ã§ã³ã¯ã batch_size
ãæžãããŠãåworker
äœæ¥ãå°ãªããªããååŠçãããéãå®äºããããã«ããããšã§ãã åŸè
ã®ãªãã·ã§ã³ã¯ãã¢ã€ãã«ç¶æ
ã®GPUã¡ã¢ãªã䜿çšãããŠããªããããæãŸãããªãå ŽåããããŸãã
ãŸããååŠçã®äžéšããªãã©ã€ã³ã§å®è¡ããåã¯ãŒã«ãŒã®éã¿ãåãé€ãããšãæ€èšããããšãã§ããŸãã ããšãã°ãåã¯ãŒã«ãŒãwavãã¡ã€ã«ãèªã¿èŸŒãã§ããªãŒãã£ãªãã¡ã€ã«ã®ã¹ãã¯ããã°ã©ã ãèšç®ããŠããå Žåãã¹ãã¯ããã°ã©ã ããªãã©ã€ã³ã§äºåã«èšç®ããèšç®ãããã¹ãã¯ããã°ã©ã ãã¯ãŒã«ãŒã®ãã£ã¹ã¯ããèªã¿åãããšãæ€èšã§ããŸãã ããã«ãããåäœæ¥è ãè¡ãå¿ èŠã®ããäœæ¥éãåæžãããŸãã
horovodã§åãåé¡ã«å¯Ÿå¿
åæ§ã®åé¡ã«å¯ŸåŠããŸã...ãšããã¯ãçµäºããæ€èšŒã®ããã«ããŒã¿ã®ããŒããéå§ããŠãããšãã«ãããããã¯ãçºçããŸã...
@jinhou @jackroosåãããšã§ããã
@jinhou @jackroosåãããšã§ããã
ãããããã®å Žåã¯ãåæ£ãã¬ãŒãã³ã°ããªãã«ããŸãã
åæ§ã®åé¡ãçºçããŸããããšããã¯ãçµäºãããšããŒã¿ããŒããŒãåæ¢ããæ°ãããšããã¯ãéå§ããŸãã
ãªãã§ãããªã«ã¶ã³ïŒ
ç§ã¯ãŸã ãã®åé¡ã«çŽé¢ããŠããŸãã pytorch1.0ãšpython3.7ã䜿çšããŸãã è€æ°ã®data_loaderã䜿çšããŠããå Žåããã®ãã°ãçºçããŸãã 3ã€æªæºã®data_loaderã䜿çšããããåäžã®GPUã䜿çšããå Žåããã®ãã°ã¯çºçããŸããã è©ŠããïŒ
- time.sleepïŒ0.003ïŒ
- pin_memory = True / False
- num_workers = 0/1
- torch.utils.data.dataloaderããimportDataLoader
- / proc / sys / kernel / shmmniã«8192ãæžã蟌ã¿ãŸã
ãããã®ã©ããåäœããŸããã 解決çããããã©ããããããŸãããïŒ
0ã«èšå®ãããnum_workersã¯ç§ã®ããã«åããã 䜿çšããŠãããã¹ãŠã®å Žæã§0ã«ãªã£ãŠããããšã確èªããå¿ èŠããããŸãã
ä»ã®ããã€ãã®æœåšçãªè§£æ±ºçïŒ
3ã7ãè¡ãæ¹æ³ã®ããã§ãã
ç§ã¯pytorch1.3ãubuntu16ã§ãã®åé¡ãçµéšããŸãããå®è¡ãé ãããworkers = 0ãé€ããŠãäžèšã®ãã¹ãŠã®ææ¡ã¯æ©èœããŸããã§ããã ããã¯ãã¿ãŒããã«ããå®è¡ããŠããå Žåã«ã®ã¿çºçããŸããJupyterããŒãããã¯å ã§ã¯ãworkers = 32ã®å Žåã§ããã¹ãŠåé¡ãããŸããã
åé¡ã¯è§£æ±ºãããŠããªãããã§ãããåéããå¿ èŠããããŸããïŒ åãåé¡ãå ±åããŠããä»ã®å€ãã®äººã ãèŠãŠããŸã...
ç§ã¯ãŸã ãã®åé¡ã«çŽé¢ããŠããŸãã pytorch1.0ãšpython3.7ã䜿çšããŸãã è€æ°ã®data_loaderã䜿çšããŠããå Žåããã®ãã°ãçºçããŸãã 3ã€æªæºã®data_loaderã䜿çšããããåäžã®GPUã䜿çšããå Žåããã®ãã°ã¯çºçããŸããã è©ŠããïŒ
- time.sleepïŒ0.003ïŒ
- pin_memory = True / False
- num_workers = 0/1
- torch.utils.data.dataloaderããimportDataLoader
- / proc / sys / kernel / shmmniã«8192ãæžã蟌ã¿ãŸã
ãããã®ã©ããåäœããŸããã 解決çããããã©ããããããŸãããïŒ0ã«èšå®ãããnum_workersã¯ç§ã®ããã«åããã 䜿çšããŠãããã¹ãŠã®å Žæã§0ã«ãªã£ãŠããããšã確èªããå¿ èŠããããŸãã
ä»ã®ããã€ãã®æœåšçãªè§£æ±ºçïŒ
- ãã«ãããã»ãã·ã³ã°ããã€ã³ããŒãset_start_method
set_start_methodïŒ 'spawn'ïŒ- cv2.setNumThreadsïŒ0ïŒ
3ã7ãè¡ãæ¹æ³ã®ããã§ãã
train.py
ããã«å€æŽããŸããïŒ
from __future__ import division
import cv2
cv2.setNumThreads(0)
import argparse
...
ãããŠããã¯ç§ã®ããã«åããŸãã
ç§ãå©ããããšãã§ããã°ããã¿ããªã
ç§ããããšåæ§ã®åé¡ãæ±ããŠããŸãããã100ãšããã¯ããšã«çºçããŸãã
CUDAãæå¹ã«ãªã£ãŠããå Žåã«ã®ã¿çºçããããšã«æ°ä»ããŸããããŸããdmesgã«ã¯ãã¯ã©ãã·ã¥ãããã³ã«ãã®ãã°ãšã³ããªããããŸãã
python[11240]: segfault at 10 ip 00007fabdd6c37d8 sp 00007ffddcd64fd0 error 4 in libcudart.so.10.1.243[7fabdd699000+77000]
ããã¯ç§ã«ã¯ããã¡ãªãã§ãããCUDAãšPythonãã«ãã¹ã¬ãããããŸãæ©èœããŠããªãããšãæããŠãããŸããã
ç§ã®ä¿®æ£ã¯ãããŒã¿ã¹ã¬ããã§cudaãç¡å¹ã«ããããšã§ãããããã¯ãç§ã®pythonãšã³ããªãã¡ã€ã«ã®ã¹ããããã§ãã
from multiprocessing import set_start_method
import os
if __name__ == "__main__":
set_start_method('spawn')
else:
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import torch
import application
ããŸãããã°ãããã¯ç§ããã®æã«å¿ èŠãšããŠããããã«ããã«çéžãã人ãå©ãããããããŸããã
@jinhou @jackroosåãããšã§ããã
ãããããã®å Žåã¯ãåæ£ãã¬ãŒãã³ã°ããªãã«ããŸãã
PyTorch 1.4ã«ã¢ããããŒãããåŸãOpenCVã䜿çšããã«åæ£ãã¬ãŒãã³ã°ã§åæ§ã®åé¡ãçºçããŸãã
ããã§ããã¬ãŒãã³ã°ãšæ€èšŒã®ã«ãŒãã®åã«ãæ€èšŒã1åå®è¡ããå¿
èŠããããŸãã
ç§ã¯ããã§å€ãã®åé¡ãæ±ããŠããŸãã ããã¯ãpytorchã®ããŒãžã§ã³ãpythonã®ããŒãžã§ã³ãããã³ç°ãªãç©çãã·ã³ïŒããããåãããã«ã»ããã¢ãããããŠããïŒéã§æç¶ããããã§ãã
æ¯ååããšã©ãŒã§ãïŒ
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/site-packages/bicep/loops.py", line 73, in __call__
for data, target in self.dataloader:
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 830, in _next_data
self._shutdown_workers()
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 942, in _shutdown_workers
w.join()
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/<me>/miniconda2/envs/<my-module>/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
ç§ã䜿çšããŠãããã·ã³ã§ããã»ã¹ãåŠçãããæ¹æ³ã«ã¯ãæããã«ããã€ãã®åé¡ããããŸãã num_workers = 0ãèšå®ããããšãé€ãã°ãäžèšã®è§£æ±ºçã¯ã©ããæ©èœããªãããã§ãã
ç§ã¯æ¬åœã«ããã®åºã«å°éã§ããããã«ããããšæããŸãã誰ãããããã©ãããå§ããã¹ããããŸãã¯ãããã©ã®ããã«å°åãããã«ã€ããŠäœãèãããããŸããïŒ
ç§ãããã«ããŸãã
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
data = self.data_queue.get(timeout=timeout)
File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/home/miniconda/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 65, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 95106) is killed by signal: Segmentation fault.
1ã€ã®èå³æ·±ãããšã¯
ããŒã¿ã1è¡ãã€è§£æããã ãã§ã¯ã次ã®åé¡ã¯çºçããŸããã
with open(current_file, mode='rb') as f:
text = f.read().decode('utf-8')
all_data.extend(text.split('\n'))
ããããè¡ããšã«èªã¿åã£ãåŸã«JSON解æããžãã¯ãè¿œå ãããšããã®ãšã©ãŒãå ±åãããŸã
with open(current_file, mode='rb') as f:
text = f.read().decode('utf-8')
all_data.extend(text.split('\n'))
json_data = []
for line in all_data:
try:
json_data.append(json.loads(line))
except:
break
return json_data
JSONã¡ã¢ãªã®ãªãŒããŒããããçºçããããšã¯ç解ããŠããŸãããã¯ãŒã«ãŒã®æ°ã2ã«æžãããŠããããŒã¿ã»ãããéåžžã«å°ãããããåãåé¡ãçºçããŸãã ç§ã¯ãããshmã«é¢é£ããŠããã®ã§ã¯ãªãããšçã£ãŠããŸãã ã©ããªææããïŒ
ãã®å·ãåéããŸãããïŒ
ç§ãã¡ã¯ãã¹ãã ãšæããŸãã ãšããã§ãç§ã¯ããã€ãã®GDBãããã°ãè¡ããŸããããäœãèŠã€ãããŸããã§ããã ã§ãããããããå ±æã¡ã¢ãªã®åé¡ã§ãããã©ããã¯ããããããŸããã
(gdb) run
Starting program: /home/miniconda/bin/python performance.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffa60a6700 (LWP 61963)]
[New Thread 0x7fffa58a5700 (LWP 61964)]
[New Thread 0x7fffa10a4700 (LWP 61965)]
[New Thread 0x7fff9e8a3700 (LWP 61966)]
[New Thread 0x7fff9c0a2700 (LWP 61967)]
[New Thread 0x7fff998a1700 (LWP 61968)]
[New Thread 0x7fff970a0700 (LWP 61969)]
[New Thread 0x7fff9489f700 (LWP 61970)]
[New Thread 0x7fff9409e700 (LWP 61971)]
[New Thread 0x7fff8f89d700 (LWP 61972)]
[New Thread 0x7fff8d09c700 (LWP 61973)]
[New Thread 0x7fff8a89b700 (LWP 61974)]
[New Thread 0x7fff8809a700 (LWP 61975)]
[New Thread 0x7fff85899700 (LWP 61976)]
[New Thread 0x7fff83098700 (LWP 61977)]
[New Thread 0x7fff80897700 (LWP 61978)]
[New Thread 0x7fff7e096700 (LWP 61979)]
[New Thread 0x7fff7d895700 (LWP 61980)]
[New Thread 0x7fff7b094700 (LWP 61981)]
[New Thread 0x7fff78893700 (LWP 61982)]
[New Thread 0x7fff74092700 (LWP 61983)]
[New Thread 0x7fff71891700 (LWP 61984)]
[New Thread 0x7fff6f090700 (LWP 61985)]
[Thread 0x7fff7e096700 (LWP 61979) exited]
[Thread 0x7fff6f090700 (LWP 61985) exited]
[Thread 0x7fff74092700 (LWP 61983) exited]
[Thread 0x7fff7b094700 (LWP 61981) exited]
[Thread 0x7fff80897700 (LWP 61978) exited]
[Thread 0x7fff83098700 (LWP 61977) exited]
[Thread 0x7fff85899700 (LWP 61976) exited]
[Thread 0x7fff8809a700 (LWP 61975) exited]
[Thread 0x7fff8a89b700 (LWP 61974) exited]
[Thread 0x7fff8d09c700 (LWP 61973) exited]
[Thread 0x7fff8f89d700 (LWP 61972) exited]
[Thread 0x7fff9409e700 (LWP 61971) exited]
[Thread 0x7fff9489f700 (LWP 61970) exited]
[Thread 0x7fff970a0700 (LWP 61969) exited]
[Thread 0x7fff998a1700 (LWP 61968) exited]
[Thread 0x7fff9c0a2700 (LWP 61967) exited]
[Thread 0x7fff9e8a3700 (LWP 61966) exited]
[Thread 0x7fffa10a4700 (LWP 61965) exited]
[Thread 0x7fffa58a5700 (LWP 61964) exited]
[Thread 0x7fffa60a6700 (LWP 61963) exited]
[Thread 0x7fff71891700 (LWP 61984) exited]
[Thread 0x7fff78893700 (LWP 61982) exited]
[Thread 0x7fff7d895700 (LWP 61980) exited]
total_files = 5040. //customer comments
[New Thread 0x7fff6f090700 (LWP 62006)]
[New Thread 0x7fff71891700 (LWP 62007)]
[New Thread 0x7fff74092700 (LWP 62008)]
[New Thread 0x7fff78893700 (LWP 62009)]
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/home/miniconda/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 62005) is killed by signal: Segmentation fault.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "performance.py", line 62, in <module>
main()
File "performance.py", line 48, in main
for i,batch in enumerate(rl_data_loader):
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
idx, data = self._get_data()
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
success, data = self._try_get_data()
File "/home/zhrui/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 62005) exited unexpectedly
[Thread 0x7fff78893700 (LWP 62009) exited]
[Thread 0x7fff74092700 (LWP 62008) exited]
[Thread 0x7fff71891700 (LWP 62007) exited]
[Thread 0x7fff6f090700 (LWP 62006) exited]
[Inferior 1 (process 61952) exited with code 01]
(gdb) backtrace
No stack.
ãããŠãç§ã¯ååãªå ±æã¡ã¢ãªãæã£ãŠãããšæããŸããå°ãªããšãç§ã¯å ±æã¡ã¢ãªãsegfaultãŸã§ããªãé·ãéååã§ãããšæåŸ ããŠããŸãããã»ã°ã¡ã³ãé害ã¯ããŒã¿ããŒããŒãžã§ããèµ·åããçŽåŸã«çºçããŸã
------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384
------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398509481980
min seg size (bytes) = 1
------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767
ããã«ã¡ã¯@ soumith @ apaszke ããã®åé¡ãå床éãããšãã§ããŸããïŒ shmã®ãµã€ãºãšã»ã°ã¡ã³ããå¢ãããªã©ãææ¡ããããã¹ãŠã®ãœãªã¥ãŒã·ã§ã³ãè©ŠããŸããããäœãæ©èœããŸãããopencvãªã©ã䜿çšããŠããŸãããåçŽãªJSON解æã ãã§ãã ãããããŸã ããã«åé¡ããããŸãã ãã§ãã¯ããã®ã§ããã¹ãŠã®ã¡ã¢ãªãå ±æã¡ã¢ãªãšããŠéããŠããã®ã§ãshmã«é¢é£ããŠãããšã¯æããŸããã ã¹ã¿ãã¯ãã¬ãŒã¹ã«ããäžèšã®ããã«äœã衚瀺ãããŸããã
@apaszke ãããªãã®ææ¡ã«é¢ããŠ
ãã¯ãããããã®åé¡ã®å€ãã¯ããµãŒãããŒãã£ã®ã©ã€ãã©ãªããã©ãŒã¯ã»ãŒãã§ã¯ãªãããšãåå ã§ããå¥ã®è§£æ±ºçã®1ã€ã¯ãspawnstartã¡ãœããã䜿çšããããšã§ããã
ããŒã¿ããŒããŒãã«ãã¯ãŒã«ãŒã䜿çšããŠããŸãããæ¹æ³ãå€æŽããã«ã¯ã©ãããã°ããã§ããïŒ main.pyã«set_start_method('spawn')
ãèšå®ããŠããŸããã圹ã«ç«ããªãããã§ã
ãŸããããã§äžè¬çãªè³ªåããããŸãããã«ãã¯ãŒã«ãŒïŒãã«ãããã»ã¹ïŒããŒã¿ããŒããŒãæå¹ã«ããå Žåãã¡ã€ã³ãã¬ãŒãã³ã°ã§ã¯ã httpsïŒ//pytorch.org/docs/stable/notes/multiprocessing.htmlã§ææ¡ãããŠããããã«ãã«ãããã»ã¹ãéå§ã
pytorchã¯ããŒã¿ããŒããŒãšã¡ã€ã³ãã¬ãŒãã³ã°ãã«ãããã»ã¹ã®äž¡æ¹ãã©ã®ããã«ç®¡çããŸããïŒ ãã«ãã³ã¢GPUã§å¯èœãªãã¹ãŠã®ããã»ã¹/ã¹ã¬ãããå
±æããŸããïŒ ãŸãããã«ãããã»ã¹ã®å
±æã¡ã¢ãªã¯ãããŒã¿ããŒããŒãšã¡ã€ã³ãã¬ãŒãã³ã°ããã»ã¹ã«ãã£ãŠãå
±æããããŸããïŒ ãŸããJSON解æãCSV解æããã³ãã®ç¹åŸŽæœåºãªã©ã®ããŒã¿ã¯ãã¯ãžã§ããããå ŽåããããŸãã ãªã©ãã©ãã«çœ®ãã®ãæåã®æ¹æ³ã§ããïŒ ããŒã¿ããŒããŒã§ãæºåãæŽã£ãå®ç§ãªããŒã¿ãçæããããã¡ã€ã³ãã¬ãŒãã³ã°ã䜿çšããŠãããŒã¿ããŒããŒ__get_item__
ãå¯èœãªéãã·ã³ãã«ã«ä¿ã€ããã«äžèšã§ææ¡ããããã«ããŸãã
@zhangruiskylineããªãã®åé¡ã¯å®éã«ã¯ãããããã¯ã§ã¯ãããŸããã ããã¯ãã»ã°ã¡ã³ããŒã·ã§ã³éåã«ãã£ãŠæ®ºãããåŽåè ã«ã€ããŠã§ãã sigbusã¯ãshmã®åé¡ã瀺åãããã®ã§ãã ããŒã¿ã»ããã³ãŒãã確èªããããã§ãããã°ããå¿ èŠããããŸãã
ä»ã®è³ªåã«çããã«ã¯ã
multiproessing_context='spawn'
ã䜿çšãããšãã¹ããŒã³ãèšå®ãããŸãã set_start_method
ããããè¡ããŸãã@SsnLã«æè¬ãmultiproessing_context='spawn'
ãè¿œå ããŸããããåã倱æã§ãã
åã®ã¹ã¬ããã§ææããŸããããç§ã®ã³ãŒãã¯éåžžã«åçŽã§ããã
with open(current_file, mode='rb') as f:
text = f.read().decode('utf-8')
all_data.extend(text.split('\n'))
with open(current_file, mode='rb') as f:
text = f.read().decode('utf-8')
all_data.extend(text.split('\n'))
json_data = []
for line in all_data:
try:
json_data.append(json.loads(line))
except:
break
return json_data
ãããã£ãŠããããç§ã®ã³ãŒãã®åé¡ã§ãããšã¯æããŸããããŸããJSON解æã䜿çšããã«ãçŽæ¥æåååå²ã䜿çšããããšããŠããŸããåãåé¡ã§ãã ããŒã¿ããŒããŒã®ããŒã¿åŠçã«æéã®ãããããžãã¯ãããéãããã®åé¡ãçºçããããã§ã
ãŸãã«ã€ããŠ
ãã«ãããã»ã¹ãã¬ãŒãã³ã°ã§ã¯ãåããã»ã¹ã«ç¬èªã®DataLoaderããããããDataLoaderã¯ãŒã«ãŒæ瀺çã«è¡ãããªãéããããã»ã¹éã§å ±æããããã®ã¯ãããŸããã
ã§ã¯ããã¬ãŒãã³ã°çšã®ããã»ã¹ã4ã€ãããããããã«8ã€ã®ã¯ãŒã«ãŒããŒã¿ããŒããŒãããã®ã§ããã®äžã«åèš32ã®ããã»ã¹ããããšããŸãããã
@zhangruiskylineåé¡ãåçŸããããã®èªå·±å®çµåã®ã¹ã¯ãªããããªããã°ãç§ãã¡ã¯ããªããå©ããããšãã§ããŸããã ã¯ãã32ã®ããã»ã¹ããããŸã
ãããã§ãç§ãåæ§ã®åé¡ãèŠãŸãã
https://github.com/pytorch/pytorch/issues/4969
https://github.com/pytorch/pytorch/issues/5040
äž¡æ¹ãšãéããŠããŸãããæ確ãªè§£æ±ºçãä¿®æ£ãèŠåœãããŸãããããã¯ãŸã å¹ åºãæ¢åã®åé¡ã§ããïŒ
èªå·±å®çµåã®è€è£œã¹ã¯ãªãããæäŸã§ãããã©ããã確èªããŸããããã©ãããã©ãŒã ãšããŒã¿ãœãŒã¹ã«é«åºŠã«çµ±åãããŠãããããè©ŠããŠã¿ãŸã
@zhangruiskylineããªããããããèªãã å Žåãããªãã®åé¡ã¯ãªã³ã¯ãããåé¡ã®ãããã«ã䌌ãŠããŸããã ãããã®ã¹ã¬ããã§å ±åãããå ã®/æãäžè¬çãªåé¡ã¯ãã§ã«å¯ŸåŠãããŠããããããããã¯éããããŠããŸãã
@SsnLã«æè¬ãPytorchã«ããŸã粟éããŠããªãã®ã§ãééã£ãŠããå¯èœæ§ããããŸãããç§ã¯ããããã¹ãŠãå®è¡ããŸããããããŠãããã®ããã€ãã¯ã«ãã£ãŠè§£æ±ºãããããã§ã
ã¯ãŒã«ãŒã®æ°ã0ã«æžãããŸããããã¯é ããããããåãå ¥ããããŸããã
shmã®ãµã€ãºã倧ããããŸãããååãªshmããããšæããŸããåé¡ã¯éå§çŽåŸã«çºçããã¯ããã«å°ããããŒã¿ã»ããã§è©ŠããŸãããåé¡ã¯åãã§ãã
opencvã®ãããªããã€ãã®libã¯ããã«ãããã»ã¹ã§ã¯ããŸãæ©èœããŸãããJSON/ CSVã䜿çšããŠããã ããªã®ã§ãããã»ã©æŽŸæãªãã®ã§ã¯ãããŸããã
ã³ãŒãã¯ããªãåçŽã§ããã¬ãŒãã³ã°ããŒã¿ã»ããã«ã¯10ââ000以äžã®ãã¡ã€ã«ããããåãã¡ã€ã«ã¯è€æ°è¡ã®JSONæååã§ãã ããŒã¿ããŒããŒã§ã¯ã __get_item__
ãå®çŸ©ããŠã
ãœãªã¥ãŒã·ã§ã³1ã§ã¯ãæåã«1è¡ãã€èªã¿åããJSONæååãªã¹ãã«åå²ããŸããããã«æ»ããšãæ©èœããããã©ãŒãã³ã¹ã¯è¯å¥œã§ãã
with open(current_file, mode='rb') as f:
text = f.read().decode('utf-8')
all_data.extend(text.split('\n'))
return all_data
æ»ãå€ã¯ãŸã JSONæååã§ããããããã«ãããã»ã¹ããŒã¿ããŒããŒãå©çšããŠé床ãäžãããã®ã§ãããã«JSON解æããžãã¯ãé 眮ãããšå€±æããŸã
with open(current_file, mode='rb') as f:
text = f.read().decode('utf-8')
all_data.extend(text.split('\n'))
json_data = []
for line in all_data:
try:
json_data.append(json.loads(line))
except:
break
return json_data
åŸã§ãJSONã®è§£æãéãããJSONã®ã¡ã¢ãªãããããªã³ããå€ããããšèãããããJSONæååã解æããæåã§æ©èœãªã¹ãã«å€æããããšãéžæããŸãããåã倱æã§ãã ã¹ã¿ãã¯ãã¬ãŒã¹åæãè¡ããŸããããäœãããŸããã§ãã
ãšããã§ãLinux Docker Envã24ã³ã¢CPUã1V100ã§ã³ãŒããå®è¡ããŠããŸãã
次ã«ã©ããã調æ»ãå§ããã°ããã®ãããããŸããã äœãèãããããŸããïŒ
ããã«ã¡ã¯ã
https://github.com/open-mmlab/mmdetectionã§äœ¿çšãããŠããhttps://github.com/open-mmlab/mmcvã§èå³æ·±ãã³ã¡ã³ããèŠã€ããŸããïŒ
次ã®ã³ãŒãã¯ãtrainepochãšvalepochã®äž¡æ¹ã®å
é ã§äœ¿çšãããŸãã
time.sleepïŒ2ïŒïŒãšããã¯ç§»è¡äžã«çºçããå¯èœæ§ã®ãããããããã¯ãé²æ¢ããŸã
ãã¶ãããªãã¯ãããè©Šãããšãã§ããŸãã
ãšããã§ããã«ãããã»ã¹ãšãã«ãã¯ãŒã«ãŒããŒã¿ããŒããŒã䜿çšããåããã»ã¹ã«ç§»åããå Žåã察å¿ããããŒã¿ããŒããŒãä»ã®ããã»ã¹ã®ããŒã¿ããŒããŒãšåãããŒã¿ãèªã¿åããªãããã«ããã«ã¯ãã©ã®ããã«ç°ãªãããã»ã¹ã䜿çšã§ããŸããïŒ tiã¯ãã§ã«pytorchããŒã¿ããŒããŒ__get_item__
ã«ãã£ãŠåŠçãããŠããŸããïŒ
ããã«ã¡ã¯@SsnL ãããªãã®å©ããããããšãã ãã®ã¹ã¬ãããå°ããã©ããŒã¢ãããããã®ã§ãããpytorchãã«ãããã»ãã·ã³ã°ã䜿çšããŠãã¬ãŒãã³ã°ã³ãŒãããªãã¡ã¯ã¿ãªã³ã°ããCPUåŽã§ã®ããŒã¿åŠçãé«éåããŸãïŒGPUãžã®ãã£ãŒããé«éåããããïŒã httpsïŒ //pytorch.org/docs/stable
ååŠçæ©èœã§ã¯ããã«ãã¯ãŒã«ãŒããŒã¿ããŒããŒã䜿çšããŠãããŒã¿ã®èªã¿èŸŒã¿åŠçæéãççž®ããŠããŸãã https://pytorch.org/docs/stable/data.html
ããŒãã³ã°CPUã®JSON解æãããŒã¿ããŒããŒã§ã¯ãªããã¡ã€ã³ã®ãã¬ãŒãã³ã°ããã»ã¹ã«é
眮ããŸããããåé¡ã解決ããããã§ããçç±ã¯ããããŸãããããšã«ããæ©èœããŠããããã§ãã ãã ãããã©ããŒã¢ããã®è³ªåããããŸãã N
åŠçããããããããã«M
ããŒããŒã¯ãŒã«ãŒããããšãããšãã¹ã¬ããã®äžã«åèšNxM
ããããŸãã
ããŒã¿ããŒããŒã§ãã¹ãŠã®ããŒã¿ãã€ã³ããã¯ã¹æ¹åŒã§ååŸãããå Žåãã€ãŸããNåã®ç°ãªãåŠçã®MããŒã¿ããŒããŒã®__get_item__(self, idx)
ãé£æºããŠç°ãªãã€ã³ããã¯ã¹ãåŠçã§ããå Žåãéè€ããŠåŠçãããªãããã«ããã«ã¯ã©ãããã°ããã§ããïŒããã€ãã®ããã»ã¹ãéããŸããïŒ
æ°ãããã¬ãŒãã³ã°ãŸãã¯æ€èšŒãšããã¯ã®éå§æã«ã¡ã¢ãªãå²ãåœãŠãããšãã§ããªããšæå¥ãèšã£ãåŸãããŒã¿ããŒããŒãã¯ã©ãã·ã¥ãããšããåãåé¡ããããŸããã äžèšã®è§£æ±ºçã¯ç§ã«ã¯æ©èœããŸããã§ããïŒiïŒç§ã®
/dev/shm
ã¯32GBã§ããã2.5GBãè¶ ããŠäœ¿çšãããããšã¯ãªããïŒiiïŒpin_memory = Falseã®èšå®ã¯æ©èœããŸããã§ãããããã¯ããããã¬ããŒãžã³ã¬ã¯ã·ã§ã³ãšé¢ä¿ããããŸããïŒ ç§ã®ã³ãŒãã¯ãããã次ã®ããã«ãªããŸãã ç¡éã®ã€ãã¬ãŒã¿ãå¿ èŠãªã®ã§ã以äžã®
next()
ãé€ããŠ/ãè©ŠããŠã¿ãŸã:-)def train(): train_iter = train_loader.__iter__() for i in xrange(max_batches): try: x, y = next(train_iter) except StopIteration: train_iter = train_loader.__iter__() ... del train_iter
train_loader
ã¯DataLoader
ãªããžã§ã¯ãã§ãã é¢æ°ã®æåŸã«æ瀺çãªdel train_iter
è¡ããªããšãããã»ã¹ã¯åžžã«2ã3ãšããã¯åŸã«ã¯ã©ãã·ã¥ããŸãïŒ/dev/shm
åŒãç¶ã2.5 GBã瀺ããŸãïŒã ã圹ã«ç«ãŠãã°ïŒç§ã¯
4
ã¯ãŒã«ãŒã䜿çšããŠããŸãïŒUbuntu16.04ã®CUDA8.0ã§ããŒãžã§ã³0.1.12_2
ïŒã
ããã¯ãäœé±éãèŠåŽããåŸãç§ã«ãšã£ãŠåé¡ã解決ããŸããã ããŒããŒãçŽæ¥ã«ãŒãããã®ã§ã¯ãªããããŒããŒã€ãã¬ãŒã¿ãŒãæ瀺çã«äœ¿çšããå¿
èŠããããŸãããšããã¯ã®æåŸã«del loader_iterator
ã䜿çšãããšãæçµçã«ãããããã¯ã解æ¶ãããŸãã
ç§ã¯åãåé¡ã«çŽé¢ããŠãããšæããŸãã 8ã€ã®ããŒã¿ããŒããŒïŒMNISTãMNISTMãSVHNãUSPSãããããã®ãã¬ãŒãã³ã°ãšãã¹ãã«äœ¿çšïŒã䜿çšããããšããŠããŸãã 6ïŒä»»æã®6ïŒã䜿çšãããšåé¡ãªãåäœããŸãã 6çªç®ã®MNIST-Mãã¹ããããŒããããšãã«8ã䜿çšãããšåžžã«ãããã¯ãããŸãã ããã¯ãç»åãååŸããããšãã倱æããå°ãåŸ ã£ãŠããåè©Šè¡ãããšããç¡éã®ã«ãŒãã§ç«ã¡åŸçããŠããŸãã ãšã©ãŒã¯ã©ã®batch_sizeã§ãæç¶ããååãªç©ºãã¡ã¢ãªãæ®ã£ãŠããŸããnum_workersã0ã«èšå®ããå Žåã«ã®ã¿ããšã©ãŒã¯è§£æ¶ãããŸãããã®ä»ã®éãæå®ãããšãåé¡ãçºçããŸãã
https://stackoverflow.com/questions/54013846/pytorch-dataloader-stucked-if-using-opencv-resize-methodãããã³ããåŸãŸãã
cv2.setNumThreads(0)
ãå
¥ãããšãåé¡ãªãåäœããŸãã
ããã«ã¡ã¯ãç§ã¯åãåé¡ãæ±ããŠããŸããã ãããŠããã¯ulimit-nãšé¢ä¿ããããåã«ãããå¢ãããšåé¡ã¯è§£æ±ºããŸãããç§ã¯ulimit -n500000ã䜿çšããŸãã
@SebastienEske ulimit -nã¯ã
ãã¶ãã ulimit -nãèšå®ããã®ãæ£ããæ¹æ³ã§ããã¢ãã«ã®å¢å ã«äŒŽãããããããã¯ããŸããŸãé »ç¹ã«ãªããŸãã cv2.setNumThreads(0)
ããã¹ãããŸãããæ©èœããŸããã
èšé²ã®ããã«ã cv2.setNumThreads(0)
ã¯ç§ã®ããã«åããã
æãåèã«ãªãã³ã¡ã³ã
åæ§ã®åé¡ãçºçããŸããããšããã¯ãçµäºãããšããŒã¿ããŒããŒãåæ¢ããæ°ãããšããã¯ãéå§ããŸãã