Pytorch: RuntimeError: cuda ๋Ÿฐํƒ€์ž„ ์˜ค๋ฅ˜(2): /data/users/soumith/miniconda2/conda-bld/pytorch-0.1.9_1487346124464/work/torch/lib/THC/generic/THCStorage.cu:66์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ

์— ๋งŒ๋“  2017๋…„ 03์›” 08์ผ  ยท  41์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: pytorch/pytorch

์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค.

THCudaCheck FAIL file=/data/users/soumith/miniconda2/conda-bld/pytorch-0.1.9_1487346124464/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
  File "main_snli.py", line 293, in <module>
    experiment=BaseExperiment()
  File "main_snli.py", line 74, in __init__
    self.model.cuda()
  File "/home/bbbian/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 143, in cuda
    return self._apply(lambda t: t.cuda(device_id))
  File "/home/bbbian/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 114, in _apply
    module._apply(fn)
  File "/home/bbbian/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 114, in _apply
    module._apply(fn)
  File "/home/bbbian/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 120, in _apply
    param.data = fn(param.data)
  File "/home/bbbian/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 143, in <lambda>
    return self._apply(lambda t: t.cuda(device_id))
  File "/home/bbbian/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 51, in _cuda
    return self.type(getattr(torch.cuda, self.__class__.__name__), async)
  File "/home/bbbian/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 24, in _type
    return new_type(self.size()).copy_(self, async)
RuntimeError: cuda runtime error (2) : out of memory at /data/users/soumith/miniconda2/conda-bld/pytorch-0.1.9_1487346124464/work/torch/lib/THC/generic/THCStorage.cu:66

์ด ์˜ค๋ฅ˜๋ฅผ ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

GPU์˜ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค. ๋ฒ„๊ทธ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค.

๋ชจ๋“  41 ๋Œ“๊ธ€

GPU์˜ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค. ๋ฒ„๊ทธ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค.

@apaszke
๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฐ„๋‹จํ•œ ํ…Œ์ŠคํŠธ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๋ฉด 'out of memory.....' ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉฐ ํ…Œ์ŠคํŠธ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ฐจ์›์€ 49200์ž…๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ 49200์—์„œ 1000์œผ๋กœ ๋” ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ์ฐจ์›์„ ์‹œ๋„ํ–ˆ์„ ๋•Œ ์ฝ”๋“œ๋Š” ์ •์ƒ์ ์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.
๋ณ€๊ฒฝํ•ด์•ผ ํ•˜๋Š” pytorch์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์„ค์ •์ด ์žˆ์Šต๋‹ˆ๊นŒ?

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.dropout = nn.Dropout(p=0.2)
        self.relu = nn.ReLU()
        self.fc1 = nn.Linear(49200, 49200)
        self.fc2 = nn.Linear(49200, 49200)
        self.fc3 = nn.Linear(49200, 3)
        self.out = nn.Sequential(
            self.fc1,
            self.relu,
            self.dropout,
            self.fc1,
            self.relu,
            self.dropout,
            self.fc3
            )
    def forward(self, premise, hypothesis):
        return self.out(torch.cat([premise, hypothesis], 1))

net = Net().cuda()
print (net)
premise = Variable(torch.randn(64, 82, 300))
hypothesis = Variable(torch.randn(64, 82, 300))
premise = premise.cuda()
hypothesis = hypothesis.cuda()
out = net(premise.contiguous().view(64,-1), hypothesis.contiguous().view(64,-1))
print(out)

๋‘ ๊ฐœ์˜ ํฐ FC ๋ ˆ์ด์–ด์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ๊ธฐ์šธ๊ธฐ ์‚ฌ์ด์—์„œ ๋„คํŠธ์›Œํฌ(ํฌ๊ธฐ 49200)์—๋Š” 40GB์˜ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค...

@jekbradbury ๊ณ„์‚ฐ์„ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ฐ ๊ทธ๋ผ๋””์–ธํŠธ์™€ ๊ด€๋ จํ•˜์—ฌ ๊ฐ ๋ ˆ์ด์–ด๊ฐ€ ์ฐจ์ง€ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ๋Š” ์–ผ๋งˆ์ž…๋‹ˆ๊นŒ? ๊ฐ์‚ฌ ํ•ด์š”.

ํ•ด๋‹น ๋ชจ๋ธ์—์„œ ๋‹จ์ผ ์„ ํ˜• ๋ ˆ์ด์–ด์˜ ๊ฐ€์ค‘์น˜๋งŒ ๊ณ ๋ คํ•˜๋Š” ๊ฒฝ์šฐ. ๋‹น์‹ ์€ ์–ป์„

49200^2 = 2ย 420ย 640ย 000

์š”์†Œ + ๊ฐ ์š”์†Œ๋Š” 4๋ฐ”์ดํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ

2ย 420ย 640ย 000 * 4 / 1024^3 = 9,01GB

๋ฌด๊ฒŒ๋งŒ์„ ์œ„ํ•ด. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๊ทธ๋ผ๋””์–ธํŠธ๋ฅผ ์ €์žฅํ•˜๋ ค๋ฉด ์ด ํฌ๊ธฐ์˜ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์ฒญํฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์•ˆ๋…•ํ•˜์„ธ์š”, ๋™์ผํ•œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ–ˆ์ง€๋งŒ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ๋ฅผ ์œ„ํ•ด์„œ๋งŒ ์˜ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ „์ฒด ๊ต์œก ๊ณผ์ •์€ ์™„๋ฒฝํ•˜๊ฒŒ ์ž˜ ์ž‘๋™ํ–ˆ์Šต๋‹ˆ๋‹ค. Inception v3๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ด ํ•™์Šต์„ ํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ฌด๋„ ๋‚˜๋ฅผ ๋„์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ๊ฐ์‚ฌ ํ•ด์š”

@tabibusairam ๋‚˜๋„ ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ต์œก ํ”„๋กœ์„ธ์Šค๋Š” ์ž˜ ์ž‘๋™ํ–ˆ์ง€๋งŒ(6G cuda ๋ฉ”๋ชจ๋ฆฌ์™€ ๋‚ด GPU์—๋Š” 12G ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์žˆ์Œ) ๋™์ผํ•œ ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ต๊ณผํ•˜๋Š” ํ‰๊ฐ€ ํ”„๋กœ์„ธ์Šค๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜ค๋ฅ˜ ์ •๋ณด๋ฅผ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
  File "evaluate.py", line 132, in <module>
    evaluate(pnet, args)
  File "evaluate.py", line 94, in evaluate
    predictions = pnet(X_test, initial_states)
  File "/home/zcrwind/.conda/envs/condapython3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zcrwind/workspace/pro/predict/zcr/pnet.py", line 497, in forward
    output, hidden_states = self.step(A0, hidden_states)
  File "/home/zcrwind/workspace/pro/predict/zcr/pnet.py", line 377, in step
    forget_gate = hard_sigmoid(self.conv_layers['f'][lay](inputs))
  File "/home/zcrwind/workspace/pro/predict/zcr/pnet.py", line 28, in hard_sigmoid
    x = F.threshold(-x, 0, 0)
  File "/home/zcrwind/.conda/envs/condapython3.6/lib/python3.6/site-packages/torch/nn/functional.py", line 459, in threshold
    return _functions.thnn.Threshold.apply(input, threshold, value, inplace)
  File "/home/zcrwind/.conda/envs/condapython3.6/lib/python3.6/site-packages/torch/nn/_functions/thnn/auto.py", line 174, in forward
    getattr(ctx._backend, update_output.name)(ctx._backend.library_state, input, output, *args)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCStorage.cu:66

์šด๋™ํ•˜์…จ๋‚˜์š”? ๊ฐ์‚ฌ ํ•ด์š”.

์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ๋™์•ˆ์˜ ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„๋Š” ๊ธฐ์ฐจ์—์„œ์™€ ๊ฐ™์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค.
๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ์—์„œ ํ›ˆ๋ จ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋ช…๋ น์„ ์‚ฌ์šฉํ•ด๋ณด์‹ญ์‹œ์˜ค -
nvidia-smi๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์ค‘์— GPU ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ ์‚ฌํ•ญ์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์ค„์—ฌ๋ณด์‹ญ์‹œ์˜ค(๋‹จ์ผ GPU์—์„œ๋งŒ ์ž‘์—…ํ•˜๋Š” ๊ฒฝ์šฐ).
๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ ์‚ฌํ•ญ์€ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์„์ˆ˜๋ก ์ ์Šต๋‹ˆ๋‹ค.

2018๋…„ 1์›” 14์ผ ์ผ์š”์ผ ์˜ค์ „ 11:17, Chenrui Zhang [email protected]
์ผ๋‹ค:

@tabibusairam https://github.com/tabibusairam ๋‚˜๋„ ๋งŒ๋‚ฌ๋‹ค
๋™์ผํ•œ ๋ฌธ์ œ: ๊ต์œก ํ”„๋กœ์„ธ์Šค๊ฐ€ ์ž˜ ์ž‘๋™ํ–ˆ์Šต๋‹ˆ๋‹ค(6G cuda ๋ฉ”๋ชจ๋ฆฌ ๋ฐ ๋‚ด
GPU์—๋Š” 12G ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์žˆ์Œ) ๊ฐ™์€ ๊ณผ์ •์„ ๊ฑฐ์นœ ํ‰๊ฐ€ ํ”„๋กœ์„ธ์Šค
๋„คํŠธ์›Œํฌ์— ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜ค๋ฅ˜ ์ •๋ณด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

THCudaCheck FAIL ํŒŒ์ผ=/opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ
์—ญ์ถ”์ (๊ฐ€์žฅ ์ตœ๊ทผ ํ˜ธ์ถœ ๋งˆ์ง€๋ง‰):
ํŒŒ์ผ "evaluate.py", 132ํ–‰,
ํ‰๊ฐ€(prednet, ์ธ์ˆ˜)
ํ‰๊ฐ€์—์„œ ํŒŒ์ผ "evaluate.py", 94ํ–‰
์˜ˆ์ธก = prednet(X_test, initial_states)
ํŒŒ์ผ "/home/zcrwind/.conda/envs/condapython3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", 224ํ–‰, __call__
๊ฒฐ๊ณผ = self.forward( ์ž…๋ ฅ, * kwargs)
ํŒŒ์ผ "/home/zcrwind/workspace/ijcai2018/predict/zcrPredNet/prednet.py", 497ํ–‰, ์•ž์œผ๋กœ
์ถœ๋ ฅ, hidden_states = self.step(A0, hidden_states)
ํŒŒ์ผ "/home/zcrwind/workspace/ijcai2018/predict/zcrPredNet/prednet.py", 377ํ–‰, ๋‹จ๊ณ„์ ์œผ๋กœ
forget_gate = hard_sigmoid(self.conv_layers['f'][lay](์ž…๋ ฅ))
ํŒŒ์ผ "/home/zcrwind/workspace/ijcai2018/predict/zcrPredNet/prednet.py", 28ํ–‰, hard_sigmoid
x = F.threshold(-x, 0, 0)
ํŒŒ์ผ "/home/zcrwind/.conda/envs/condapython3.6/lib/python3.6/site-packages/torch/nn/functional.py", ๋ผ์ธ 459, ์ž„๊ณ„๊ฐ’
return _functions.thnn.Threshold.apply(์ž…๋ ฅ, ์ž„๊ณ„๊ฐ’, ๊ฐ’, ์ œ์ž๋ฆฌ)
ํŒŒ์ผ "/home/zcrwind/.conda/envs/condapython3.6/lib/python3.6/site-packages/torch/nn/_functions/thnn/auto.py", 174ํ–‰, ์•ž์œผ๋กœ
getattr(ctx._backend, update_output.name)(ctx._backend.library_state, ์ž…๋ ฅ, ์ถœ๋ ฅ, *args)
RuntimeError: cuda ๋Ÿฐํƒ€์ž„ ์˜ค๋ฅ˜(2): /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCStorage.cu:66์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ

์šด๋™ํ•˜์…จ๋‚˜์š”? ๊ฐ์‚ฌ ํ•ด์š”.

โ€”
๋‹น์‹ ์ด ์–ธ๊ธ‰๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ณ  GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.
https://github.com/pytorch/pytorch/issues/958#issuecomment-357490369 ,
๋˜๋Š” ์Šค๋ ˆ๋“œ ์Œ์†Œ๊ฑฐ
https://github.com/notifications/unsubscribe-auth/AMHzdCQ_jJ9ogDm1jaNSLB6wCbfP08XOks5tKZT8gaJpZM4MW6we
.

@tabibusairam ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์ค„์˜€์œผ๋ฉฐ ํ‰๊ฐ€ ์ฝ”๋“œ๋Š” ์ด์ œ ์•„์ฃผ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

@tabibusairam pytorch.org ์˜ ์˜ˆ์ œ๋กœ ์ „์†ก ๊ธฐ๋Œ€ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•ฉ๋‹ˆ๊นŒ? ๊ทธ๋ ‡๋‹ค๋ฉด ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•ด๊ฒฐํ•  ์ƒ๊ฐ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ, ๊ทธ ํ˜•์‹์œผ๋กœ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
๋˜ํ•œ ๋ชจ๋ธ์— nn.DataParallel์„ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋ฅธ ์•„์ด๋””์–ด๋Š” ๋ฐ˜๋“œ์‹œ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค

2018๋…„ 1์›” 22์ผ ์˜ค์ „ 5์‹œ 32๋ถ„์— "Tommeychang" [email protected] ์ด ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

@tabibusairam https://github.com/tabibusairam
pytorch.org์˜ ์˜ˆ์ œ๋กœ ๊ธฐ์šธ๊ธฐ ์ฝ”๋“œ๋ฅผ ์ „์†กํ•˜์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ? ๊ทธ๋ ‡๋‹ค๋ฉด ๋‚˜๋Š”
๊ทธ๊ฒƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋˜ ๋‹ค๋ฅธ ์•„์ด๋””์–ด.

โ€”
๋‹น์‹ ์ด ์–ธ๊ธ‰๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ณ  GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.
https://github.com/pytorch/pytorch/issues/958#issuecomment-359294050 ,
๋˜๋Š” ์Šค๋ ˆ๋“œ ์Œ์†Œ๊ฑฐ
https://github.com/notifications/unsubscribe-auth/AMHzdN8mRJKNr_0czrXDd-p66-iJImubks5tM9AggaJpZM4MW6we
.

@tabibusairam ๊ฐ™์€ ์ƒํ™ฉ์—์„œ ๊ฐ™์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์ถ”๋ก  ์‹œ Variable()์—์„œ "volatile"์„ ๋ณ€๊ฒฝํ•˜์—ฌ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. volatile=True๋กœ ์„ค์ •ํ•˜๋ฉด ์ถ”๋ก  ์ค‘์— ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„๊ฐ€ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. ์ถ”๋ก  ์‹œ๊ฐ„์—๋Š” ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„๋ฅผ ์œ ์ง€ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋งŽ์ด ์†Œ๋ชจํ•ฉ๋‹ˆ๋‹ค.
`Variable(x, volatile=True)'์™€ ๊ฐ™์ด volatile ํ”Œ๋ž˜๊ทธ๋ฅผ True๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์˜ˆ์—์„œ๋Š” ํ›ˆ๋ จ ๋ฐ ๊ฒ€์ฆ์„ ์œ„ํ•ด ๊ฐ๊ฐ ๋‘ ๊ฐœ์˜ ๋ชจ๋ธ์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค. ์ด ์„ค์ •์„ ์‚ฌ์šฉํ•˜๋ฉด ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์‹œ GPU์—์„œ ๋‹ค๋ฅธ ๋ชจ๋ธ์ด ์‹คํ–‰๋˜๊ณ  ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ๋ฐ์ดํ„ฐ๋ฅผ volatile ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋ž˜ํ•‘ํ•˜๋”๋ผ๋„ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค.
์ €๋Š” ์ด ๋ฌธ์ œ๋ฅผ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋งŒ ์„ค์ •ํ•˜์—ฌ ํ•ด๊ฒฐํ•˜๊ณ  ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ๋ฐ์ดํ„ฐ๋ฅผ volatile ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋ž˜ํ•‘ํ•˜์—ฌ ๊ณ„์‚ฐ์„ ์ค„์ž…๋‹ˆ๋‹ค. @tabibusairam

@TommeyChang๋‹˜, ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ „์ด ํ•™์Šต ์ƒ˜ํ”Œ์„ ํ™•์ธํ–ˆ์ง€๋งŒ ๊ฒ€์ฆ์—์„œ๋„ ๋ชจ๋ธ์ด ์„ค์ •๋œ ์œ„์น˜๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์—†์—ˆ์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ์—์„œ ๋ชจ๋ธ์ด ์„ค์ •๋œ ์œ„์น˜๋ฅผ ๋ณด์—ฌ ์ฃผ์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ?

์ด ๋ฌธ์ œ๋Š” ์ฝ”๋“œ๊ฐ€ ์•„๋‹Œ pytorch๋กœ ์ธํ•ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
๋‹จ๊ณ„ == 'ํ›ˆ๋ จ'์ธ ๊ฒฝ์šฐ:
scheduler.step()
model.train(True) # ๋ชจ๋ธ์„ ํ›ˆ๋ จ ๋ชจ๋“œ๋กœ ์„ค์ •
๋˜ ๋‹ค๋ฅธ:
model.train(False) # ๋ชจ๋ธ์„ ํ‰๊ฐ€ ๋ชจ๋“œ๋กœ ์„ค์ •
watch -n 1 -d nvidia-smi๋กœ GPU ํ†ต๊ณ„๋ฅผ ์ถ”์ ํ•˜๋ฉด ์ฒซ ๋ฒˆ์งธ ๊ฒ€์ฆ ์—ํฌํฌ ๋•Œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ›ˆ๋ จ๊ณผ ๊ฒ€์ฆ ๋ชจ๋‘์— ๋Œ€ํ•ด ๊ฒ€์ฆ์„ ์œ„ํ•ด ๋™์ผํ•œ ๋ชจ๋ธ์„ ์–ด๋–ป๊ฒŒ ์„ ํƒํ–ˆ์Šต๋‹ˆ๊นŒ?

2018๋…„ 1์›” 27์ผ ์˜ค์ „ 11์‹œ 44๋ถ„์— "Tommeychang" [email protected] ์ด ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ฌธ์ œ๋Š” ์ฝ”๋“œ๊ฐ€ ์•„๋‹Œ pytorch๋กœ ์ธํ•ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค
์•„๋ž˜์—:
๋‹จ๊ณ„ == 'ํ›ˆ๋ จ'์ธ ๊ฒฝ์šฐ:
scheduler.step()
model.train(True) # ๋ชจ๋ธ์„ ํ›ˆ๋ จ ๋ชจ๋“œ๋กœ ์„ค์ •
๋˜ ๋‹ค๋ฅธ:
model.train(False) # ๋ชจ๋ธ์„ ํ‰๊ฐ€ ๋ชจ๋“œ๋กœ ์„ค์ •
watch -n 1 -d nvidia-smi๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ GPU ํ†ต๊ณ„๋ฅผ ์ถ”์ ํ•˜๋ฉด
์ฒซ ๋ฒˆ์งธ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์—ํฌํฌ ๋•Œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

โ€”
๋‹น์‹ ์ด ์–ธ๊ธ‰๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ณ  GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.
https://github.com/pytorch/pytorch/issues/958#issuecomment-360963591 ,
๋˜๋Š” ์Šค๋ ˆ๋“œ ์Œ์†Œ๊ฑฐ
https://github.com/notifications/unsubscribe-auth/AMHzdBKY_UCQ3QMtnUhdHoahxUx-oG4eks5tOr6ugaJpZM4MW6we
.

๋ชจ๋ธ์˜ ๋ชจ๋“œ๋ฅผ ์„ค์ •ํ•˜์ง€ ์•Š์œผ๋ฉด ์•”์‹œ์  ํ›ˆ๋ จ ๋ชจ๋“œ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋“œ ์„ธํŠธ ๋ผ์ธ์€ ํ•„์š”ํ•˜์ง€ ์•Š์ง€๋งŒ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ๋‹จ๊ณ„์—์„œ ํœ˜๋ฐœ์„ฑ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…์„œ๋ฅผ ๋ณ€์ˆ˜๋กœ ๋ž˜ํ•‘ํ•ฉ๋‹ˆ๋‹ค. ๋‚ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

if phase == 'train':
scheduler.step()

........

for data in dataloaders[phase]:  ## Iterate over data.

inputs, labels = data  ## get the inputs

if use_gpu:  ## pass them into GPU
inputs = inputs.cuda()
labels = labels.cuda()

if phase == 'train':  ## wrap them in Variable
inputs, labels = Variable(inputs), Variable(labels)
else:
inputs = Variable(inputs, volatile=True)
labels = Variable(labels, volatile=True)

๊ฐ์‚ฌ ํ•ด์š”. ํ•˜์ง€๋งŒ validation ์ค‘์—๋„ train flag๋ฅผ False๋กœ ์„ค์ •ํ•˜์ง€ ์•Š์œผ๋ฉด BatchNormalization๊ณผ Dropout์ด train/validation ๋‹จ๊ณ„์—์„œ ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ ์ ˆํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์ด ๋‘๋ ต์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜ ๋‚˜๋„ ๋„ˆ์™€ ๊ฐ™์€ ์ƒ๊ฐ์ด์•ผ. ๊ทธ๋ฆฌ๊ณ  ๋‚˜๋Š” ๊ธฐ์ฐจ ํ”Œ๋ž˜๊ทธ False๋กœ ๋‚ด ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ–ˆ๊ณ  ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์กฐ์–ธ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” volatile=True ๋ฅผ ์‹œ๋„ํ–ˆ๊ณ  ๊ทธ๊ฒƒ์€ ๋‚˜๋ฅผ ์œ„ํ•ด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. @jekbradbury ๋ฅผ ๊ฐ€๋ฅด์ณ ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

@TommeyChang @tabibusairam ๋‚˜๋Š” ๊ฐ™์€ ์˜ค๋ฅ˜๋ฅผ ์น˜๊ณ  ์žˆ์ง€๋งŒ ๋‹ค๋ฅธ ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๋‚ด ๋ชจ๋ธ์— ์ƒˆ๋กœ์šด ์ •๊ทœํ™” ์šฉ์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

def l2_reg(mdl):
        l2_reg = None
        for W in mdl.parameters():
                if W.ndimension() < 2:
                        continue
                else:   
                        if l2_reg is None:
                                l2_reg = (torch.max(torch.abs(W)))**2
                        else:   
                                l2_reg = l2_reg + (torch.max(torch.abs(W)))**2

        return l2_reg

๋‚ด๊ฐ€ ๊ด€์ฐฐํ•œ ๊ฒƒ์€ ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ 128์—์„œ 8๋กœ ๋ณ€๊ฒฝํ•˜๋”๋ผ๋„ ์ฒซ ๋ฒˆ์งธ ์—ํฌํฌ ์ดํ›„์— ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  ๋‹จ์ˆœํžˆ ์ •๊ทœํ™”๋ฅผ ๋ณ€๊ฒฝํ•˜๊ณ  l2 ์ •๊ทœํ™”๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. ์ด ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
๋ชจ๋“  ์ œ์•ˆ/์˜๊ฒฌ์€ ์ •๋ง ๊ฐ์‚ฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค!

@TommeyChang ์ผ๋ฐ˜์ ์œผ๋กœ ์ •๊ทœํ™” ์šฉ์–ด๋ฅผ ๊ตฌ๋ณ„ํ•˜๊ธฐ๋ฅผ ์›ํ•˜๋ฏ€๋กœ (๊ฒฐ๊ตญ ๊ทธ๋ผ๋””์–ธํŠธ ๊ฐ’์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๊ธฐ ๋•Œ๋ฌธ์—) pu๋Š” ์•„๋งˆ๋„ ์ œ์•ˆํ•œ๋Œ€๋กœ ์ˆ˜ํ–‰ํ•˜๊ณ  ์‹ถ์ง€ ์•Š์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@apaszke ์•ˆ๋…•ํ•˜์„ธ์š”~ ์ €๋„ ๊ฐ™์€ ์งˆ๋ฌธ์„ ๋ฐ›์•˜์ง€๋งŒ ์ฒ˜์Œ์—๋Š” ๋ชจ๋ธ์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํ›ˆ๋ จ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 600๋‹จ๊ณ„ ํ›„์— "RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/THCStorage.cu:58 " ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. .

ํ›ˆ๋ จํ•˜๋Š” ๋™์•ˆ ๋ฉ”๋ชจ๋ฆฌ ๋น„์šฉ์€ 7G(๋‚ด GPU๋Š” 11G)์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋‚ด ์˜๊ฒฌ์œผ๋กœ๋Š” ์ฒ˜์Œ์— ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ต์œกํ•œ๋‹ค๋Š” ๊ฒƒ์€ ๋‚ด ์ฝ”๋“œ๊ฐ€ ์ •ํ™•ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋งž๋‚˜์š”? ์•ž์œผ๋กœ ํ›ˆ๋ จ ๊ณผ์ •์—์„œ ์Œ“์ด๋Š” ๋‹ค๋ฅธ ๊ฒƒ๋“ค์ด ์žˆ์Šต๋‹ˆ๊นŒ? ๋งค์šฐ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!!

์ผ๋ถ€ ๋ณ€์ˆ˜๊ฐ€ ๋ˆ„์ ๋˜์–ด ๋ชจ๋ธ๋กœ ์ ์  ๋” ๋งŽ์€ ๊ณต๊ฐ„์„ ์ฐจ์ง€ํ•ฉ๋‹ˆ๋‹ค.
๋” ๋งŽ์€ ํ›ˆ๋ จ .. ๊ทธ๋Ÿฌํ•œ ๋ณ€์ˆ˜๋ฅผ ์ฐพ์•„๋ณด๊ณ  ์ €์žฅํ•˜์ง€ ์•Š๋Š”์ง€ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.
์›์น˜ ์•Š๋Š” ๊ฒƒ๋“ค

2018๋…„ 4์›” 21์ผ ํ† ์š”์ผ ์˜ค์ „ 7์‹œ 35๋ถ„ EricKani [email protected] ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

@apaszke https://github.com/apaszke ์•ˆ๋…•ํ•˜์„ธ์š”~ ์ €๋„ ๊ฐ™์€ ์งˆ๋ฌธ์„ ๋ฐ›์•˜์ง€๋งŒ
์ฒ˜์Œ์—๋Š” ๋ชจ๋ธ์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋งˆ 600๊ฑธ์Œ ํ›„์—, ๋‚˜๋Š”
"RuntimeError: cuda ๋Ÿฐํƒ€์ž„ ์˜ค๋ฅ˜(2): ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ" ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค.
/opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/THCStorage.cu:58
".

โ€”
๋‹น์‹ ์ด ์–ธ๊ธ‰๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ณ  GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.
https://github.com/pytorch/pytorch/issues/958#issuecomment-383259455 ,
๋˜๋Š” ์Šค๋ ˆ๋“œ ์Œ์†Œ๊ฑฐ
https://github.com/notifications/unsubscribe-auth/AMHzdN8KuyZIjewB6gkY1MvswGWuF1QMks5tqpPegaJpZM4MW6we
.

@tabibusairam ๋จผ์ € ์ •๋ง ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋งŽ์€ oom ๋ฌธ์ œ์˜ ์ฃผ์š” ์›์ธ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‚ด ๋„คํŠธ์›Œํฌ์˜ ๋ฌธ์ œ๋ฅผ ์ฐพ์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฒ€์ฆ(์ด๋ฏธ์ง€ ๋ณ€ํ™˜ ๋„คํŠธ์›Œํฌ) ์—†์ด ๋„คํŠธ์›Œํฌ๋ฅผ ํ›ˆ๋ จํ•  ๋•Œ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ๋Š” ํ•ญ์ƒ ์•ˆ์ •์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ๋‹จ๊ณ„๊ฐ€ ์žˆ์„ ๋•Œ ์ฒซ ๋ฒˆ์งธ GPU(ํ•ด๋‹น GPU์— ๋Œ€ํ•œ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ)์˜ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋‘ ๋ฒˆ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, ์ฒซ ๋ฒˆ์งธ Epoch๊ฐ€ ์‹œ์ž‘๋  ๋•Œ ๋‚ด GPU ๋ฉ”๋ชจ๋ฆฌ๋Š” 7G๋ฅผ ์†Œ๋น„ํ•œ ๋‹ค์Œ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์‹œ์ž‘๊ณผ ํ•จ๊ป˜ ์ฒซ ๋ฒˆ์งธ Epoch ์ดํ›„์— 9G๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ Epoch์— ๋Œ€ํ•œ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ํ›„ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„๋Š” 10G๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ ์ดํ›„๋กœ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์•ˆ์ •ํ™”๋ฉ๋‹ˆ๋‹ค. ๋‚˜ ์—„์ฒญ ํ˜ผ๋ž€์Šค๋Ÿฌ์›Œ...

volatile ๋ณ€์ˆ˜(0.3) ๋˜๋Š” torch.no_grad ์ปจํ…์ŠคํŠธ(๋งˆ์Šคํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ)๋กœ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ?

@apaszke @tabibusairam ์•ˆ๋…•ํ•˜์„ธ์š”, pytorch๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ GP๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ GAN์„ ๋นŒ๋“œํ•  ๋•Œ ์ด ์˜ค๋ฅ˜๋ฅผ ๋ฐœ๊ฒฌํ•˜๊ณ  2์ผ ๋™์•ˆ ์—ฌ๊ธฐ์— ๋ฉˆ์ท„์Šต๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ด๋ฏธ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์‹œ๋„ํ–ˆ์ง€๋งŒ ๋‘˜ ๋‹ค ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ •๋ง ๋„์›€์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค plz.
์˜ค๋ฅ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
_RuntimeError: cuda ๋Ÿฐํƒ€์ž„ ์˜ค๋ฅ˜(2): xx\torch\lib\thc\generic/THCStorage.cu:66_ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ
๋‚ด๊ฐ€ ๊ฑฐ๊พธ๋กœ ํ•  ๋•Œ

_ํŒŒ์ผ "xxx/train_extractor.py", 128ํ–‰, in
gradient_penalty.backward()
ํŒŒ์ผ "xxx\lib\site-packages\torch\autograd\variable.py", 156ํ–‰, ์—ญ๋ฐฉํ–ฅ
torch.autograd.backward(self, gradient,retain_graph,create_graph,retain_variables)
ํŒŒ์ผ "xxx\lib\site-packages\torch\autograd__init__.py", ์ค„ 98, ์—ญ๋ฐฉํ–ฅ
๋ณ€์ˆ˜, grad_variables, ์œ ์ง€_๊ทธ๋ž˜ํ”„)_

๋งค๋ฒˆ ํ›ˆ๋ จ ๊ณผ์ •์˜ 12๋ฒˆ์งธ ์—ํฌํฌ์— ๋ฐœ์ƒํ•˜๋ฉฐ ์ด๋ฏธ batch_size์™€ ๋„คํŠธ์›Œํฌ ํฌ๊ธฐ๋ฅผ ์ค„์˜€์Šต๋‹ˆ๋‹ค.
๊ฒ€์ฆ ์ ˆ์ฐจ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ์€ ๋‚ด ์ฝ”๋“œ์˜ ์ž‘์€ ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.
์•ŒํŒŒ = ํ† ์น˜.rand(conf.batch_size,1).expand(X.size())
x_hat = autograd.Variable(alpha real.data.cpu()+(1-alpha) (real.data.cpu()+0.5 real.data.std() torch.rand(real.size())), require_grad = ์‚ฌ์‹ค)
x_hat = x_hat.cuda() if conf.cuda else x_hat
pred_hat,_ = Dis(x_hat)
๋ ˆ์ด๋ธ” = ํ† ์น˜.ones(pred_hat.size())
label = label.cuda() if conf.cuda else ๋ ˆ์ด๋ธ”
๊ธฐ์šธ๊ธฐ = autograd.grad(์ถœ๋ ฅ = pred_hat, ์ž…๋ ฅ = x_hat, grad_outputs=label, create_graph=True, ์œ ์ง€_๊ทธ๋ž˜ํ”„=True,only_inputs=True)[0]
gradient_penalty = conf.gp_lambda ((gradients.norm(2,dim=1)-1) 2).mean()* gradient_penalty.backward()

๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ 64์—์„œ 32๋กœ ์ค„์ด๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค.

@lyakaap ๋ณ€์ˆ˜(x, volatile=True). ๋‚˜๋ฅผ ์œ„ํ•œ ์ผ์ด์•ผ.๊ณ ๋งˆ์›Œ.

@์—๋ฆญ์นด๋‹ˆ ,
์•ˆ๋…•ํ•˜์„ธ์š”, ๋‹น์‹ ์€์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ ํ–ˆ์Šต๋‹ˆ๊นŒ?
๋‚˜๋Š” ๋˜ํ•œ ๊ฐ™์€ ์งˆ๋ฌธ์„ ๋ฐ›๋Š”๋‹ค.
๋ฐฉ๋ฒ•์„ ์•Œ๋ ค์ฃผ์‹ค ์ˆ˜ ์žˆ๋‚˜์š”?

@qlwang25 @EricKani ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ์ƒํ™ฉ์€ ์•„๋ž˜์™€ ๊ฐ™์ด ์†์‹ค์„ ๊ณ„์‚ฐํ•˜๋Š” ๋™์•ˆ ์‹ค์ˆ˜๋กœ ๊ธฐ์šธ๊ธฐ๊ฐ€ ๋ˆ„์ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

loss = criterion(y_, y)
loss.backward()
loss_meter += loss  # incorrect
# loss_meter += loss.item()  # correct

@lyakaap
๋จผ์ € ๋Œ€๋‹จํžˆ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
๋ง์”€ํ•˜์‹ ๋Œ€๋กœ ์“ฐ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
๊ฐ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ๋ฐฐ์น˜ ํ›„ GPU ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฏ€๋กœ ๋‹ค์Œ ์—ด์ฐจ๋Š” ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 290, in <module>
    main()
  File "train.py", line 263, in main
    train(i)
  File "train.py", line 152, in train
    loss, num_total, num_correct = model.train_model(src, src_len, src_sent_len, tgt, tgt_len, optim)
  File "/home/wangqianlong/model/bytecup/models/seq2seq.py", line 110, in train_model
    loss.backward()
  File "/home/wangqianlong/.local/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/wangqianlong/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

์ผ๋ฐ˜์ ์ธ ์ฝ”๋“œ ํ๋ฆ„:

def train_model(self, data):
    outputs = self(data)
    loss = self.criterion(outputs, y)
    loss.backward()
    optim.step()
    return loss

def sample(self, data):
    src, src_len = data
        with torch.no_grad():
                bos = torch.ones(src.size(0)).long().fill_(dict.BOS)
        if self.use_cuda:
            src = src.cuda()
            src_len = src_len.cuda()
            bos = bos.cuda()

        contexts = other_function(src, src_len)
            samples = self.decoder.sample([bos], contexts)
            return samples

def train(i):
    model.train()
    global train_dataloader
    for data in train_dataloader:
        model.zero_grad()
        loss = model.train_model(data)

        count_loss += loss.item()
        if ...:
            # not important
            print(count_loss)

def eval(i):
    model.eval()
    for batch in eval_dataloader:
        samples = model.sample(data)
        print(samples)

def main():
    global train_dataloader
    for i in range(epoch):
        train_dataloader = load(data_(i%9)) 
        train(i)

        eval(i)

trainset์€ ๋น„๊ต์  ์ปค์„œ 8๊ฐœ๋กœ ๋‚˜๋ˆ„์—ˆ์Šต๋‹ˆ๋‹ค(data_0, data_1, ...., data_8).
๋‹น์‹ ์€ ๋‚˜์—๊ฒŒ ๋ช‡ ๊ฐ€์ง€ ์ œ์•ˆ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?
๋งค์šฐ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

@qlwang25 ๊ท€ํ•˜์˜ ์ฝ”๋“œ๋ฅผ ํ™•์ธํ–ˆ์ง€๋งŒ ์–ด๋–ค ๋ถ€๋ถ„์ด ์ž˜๋ชป๋œ ๊ฒƒ์ธ์ง€ ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
๋‘ ๊ฐ€์ง€ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

  1. model.zero_grad() ๋Œ€์‹  optimizer.zero_grad() ์‚ฌ์šฉ
  2. GPU์˜ ์ผ๋ถ€ ๋ณ€์ˆ˜์—๋Š” ์˜๊ตฌ ์ฐธ์กฐ๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ ์ด๋Ÿฌํ•œ ๋ณ€์ˆ˜๋Š” GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•ด์ œํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ƒ˜ํ”Œ()์„ ๊ฒ€ํ† ํ•˜๋Š” ๊ฒƒ์€ ์–ด๋–ป์Šต๋‹ˆ๊นŒ?

@lyakaap
์šฐ์„  ๋„ˆ๋ฌด ๋นจ๋ฆฌ ๋‹ต๋ณ€ํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
๋‚˜๋Š” ๋‹น์‹ ์˜ ์ฒซ ๋ฒˆ์งธ ์š”์ ์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ๋ณ€์ˆ˜๋Š” ๊ฒฐ์ฝ” GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ•ด์ œํ•˜์ง€ ์•Š์•„ ๋‚˜๋ฅผ ํ˜ผ๋ž€์Šค๋Ÿฝ๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
์–ด๋–ค ๋ณ€์ˆ˜? ์˜ˆ๋ฅผ ๋“ค์–ด ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?
์ด๋Ÿฌํ•œ ๋ณ€์ˆ˜๋ฅผ ํ•ด์ œํ•˜๋Š” ๋ฐฉ๋ฒ•. torch.cuda.empty_cache() ๊ฐ€ ์œ ์šฉํ•ฉ๋‹ˆ๊นŒ?

@qlwang25
์–ด๋–ค ๋ณ€์ˆ˜๊ฐ€ ์žˆ๋Š”์ง€ ๋ชจ๋ฅด์ง€๋งŒ src, bos๊ฐ€ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
AFAIK,torch.cuda.empty_cache()๋Š” ์ฐธ์กฐ๋œ ๋ณ€์ˆ˜๋ฅผ ํ•ด์ œํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜๊ธฐ ์ „์— ์›์ธ์ด ๋˜๋Š” ๋ณ€์ˆ˜๋ฅผ ์ฐพ๊ณ  del {var_name}์„ ์ž‘์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

@lyakaap
๋งค์šฐ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!
๋‚˜๋Š” ์ด๋ฏธ ๋‹น์‹ ์˜ ์ œ์•ˆ์„ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
๊ท€ํ•˜์˜ ๋‹ต๋ณ€์— ๋‹ค์‹œ ํ•œ ๋ฒˆ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

@ladyrick
๋งค์šฐ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!
๋‚˜๋Š” ์ด๋ฏธ ๋‹น์‹ ์˜ ์ œ์•ˆ์„ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
๊ท€ํ•˜์˜ ๋‹ต๋ณ€์— ๋‹ค์‹œ ํ•œ ๋ฒˆ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

@lyakaap ๋ง์”€ ํ•˜์‹œ๋Š” ๊ฒƒ ๊ฐ™์€๋ฐ์š”?
ใ…‹

๋‚˜๋Š” volatile์„ ์‹œ๋„ํ–ˆ์ง€๋งŒ ์ž‘๋™ํ•˜์ง€ ์•Š์•˜๋‹ค(๋‚˜์ค‘์— ๋‚ด๊ฐ€ pytroch 1.01์— ์žˆ๊ณ  "UserWarning: volatile์ด ์ œ๊ฑฐ๋˜์—ˆ์œผ๋ฉฐ ์ง€๊ธˆ์€ ํšจ๊ณผ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋Œ€์‹  with torch.no_grad(): ๋ฅผ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค.")
๊ทธ๋Ÿฌ๋‚˜ ๊ฐ„๋‹จํ•œ ๋‹ค์‹œ ์‹œ์ž‘์œผ๋กœ๋„ ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค....

๋‚˜๋Š” ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ์žˆ์–ด์„œ ๋‚ด ๋ชจ๋ธ์ด ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ๋Š” ํ•œ ์ตœ์†Œ๋กœ ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋ ค๊ณ  ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ •ํ™•๋„ ์ธก๋ฉด์—์„œ ๊ท ํ˜•์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ur epoch, learning rate, training sample์„ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ฌธ์ œ๋Š” ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž‘์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์„ ํƒํ•œ ๋‹ค์Œ ๊ฑฐ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ž…๋ ฅํ•˜์—ฌ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ๊ฒ€์ฆ์—์„œ ํ›ˆ๋ จ๋˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๊ฒ€์ฆ ์ค‘ ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„๋Š” ๊ธฐ์ฐจ์—์„œ์™€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์ค‘์— GPU ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ ์‚ฌํ•ญ์„ ๋ณด๋ ค๋ฉด - nvidia-smi ๋ช…๋ น์„ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค. ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์ค„์—ฌ ๋ณด์‹ญ์‹œ์˜ค(๋‹จ์ผ GPU์—์„œ๋งŒ ์ž‘์—…ํ•˜๋Š” ๊ฒฝ์šฐ). ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์„์ˆ˜๋ก ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ ์‚ฌํ•ญ์ด ์ ์Šต๋‹ˆ๋‹ค.
โ€ฆ
2018๋…„ 1์›” 14์ผ ์ผ์š”์ผ ์˜ค์ „ 11:17, Chenrui Zhang @ . * > ์ผ๋‹ค: @tabibusairam https://github.com/tabibusairam ๋‚˜๋„ ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค: ํ›ˆ๋ จ ๊ณผ์ •์€ ์ž˜ ์ž‘๋™ํ–ˆ์ง€๋งŒ(6G cuda ๋ฉ”๋ชจ๋ฆฌ์™€ ๋‚ด GPU์—๋Š” 12G ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์žˆ์Œ) ๋™์ผํ•œ ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ต๊ณผํ•˜๋Š” ํ‰๊ฐ€ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜ค๋ฅ˜ ์ •๋ณด: THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ Traceback(๊ฐ€์žฅ ์ตœ๊ทผ ํ˜ธ์ถœ ๋งˆ์ง€๋ง‰ ): ํŒŒ์ผ "evaluate.py", 132ํ–‰,ํ‰๊ฐ€(prednet, args) ํŒŒ์ผ "evaluate.py", 94ํ–‰, ํ‰๊ฐ€ ์˜ˆ์ธก = prednet(X_test, initial_states) ํŒŒ์ผ "/home/zcrwind/.conda/envs/condapython3.6/lib/python3.6/site- packages/torch/nn/modules/module.py", ๋ผ์ธ 224, __call__ result = self.forward( input, * kwargs) ํŒŒ์ผ "/home/zcrwind/workspace/ijcai2018/predict/zcrPredNet/prednet.py", ๋ผ์ธ 497, ์ˆœ๋ฐฉํ–ฅ ์ถœ๋ ฅ์—์„œ โ€‹โ€‹hidden_states = self.step(A0, hidden_states) ํŒŒ์ผ "/home/zcrwind/workspace/ijcai2018/predict/zcrPredNet/prednet.py", ์ค„ 377, ๋‹จ๊ณ„์—์„œ forget_gate = hard_sigmoid(self.conv_layers[' f'][lay](inputs)) ํŒŒ์ผ "/home/zcrwind/workspace/ijcai2018/predict/zcrPredNet/prednet.py", 28ํ–‰, hard_sigmoid x = F.threshold(-x, 0, 0) ํŒŒ์ผ " /home/zcrwind/.conda/envs/condapython3.6/lib/python3.6/site-packages/torch/nn/functional.py", ๋ผ์ธ 459, ์ž„๊ณ„๊ฐ’ ๋ฐ˜ํ™˜ _functions.thnn.Threshold.apply(์ž…๋ ฅ, ์ž„๊ณ„๊ฐ’ , ๊ฐ’, ์ธํ”Œ๋ ˆ์ด์Šค) ํŒŒ์ผ "/home/zcrwind/.conda/envs/condapython3.6/lib/python3.6/site-packages/torch/nn/_functions/thnn/auto. py", 174ํ–‰, ์•ž์œผ๋กœ getattr(ctx._backend, update_output.name)(ctx._backend.library_state, input, output, *args) RuntimeError: cuda ๋Ÿฐํƒ€์ž„ ์˜ค๋ฅ˜(2): /opt/conda/์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCStorage.cu:66 ์šด๋™ํ•˜์…จ๋‚˜์š”? ๊ฐ์‚ฌ ํ•ด์š”. โ€” ๋‹น์‹ ์ด ์–ธ๊ธ‰๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ฑฐ๋‚˜ GitHub < #958 (comment) >์—์„œ ํ™•์ธํ•˜๊ฑฐ๋‚˜ https://github.com/notifications/unsubscribe-auth/AMHzdCQ_jJ9ogDm1jaNSLB6wCbfP08XOks5tKZT8gaJpZM4MW6we ์Šค๋ ˆ๋“œ๋ฅผ ์Œ์†Œ๊ฑฐํ•˜์‹ญ์‹œ์˜ค.

์ด๊ฒƒ์€ ํšจ๊ณผ๊ฐ€ ์žˆ์—ˆ๋‹ค. ์ •๋ง ๊ณ ๋ง™์Šต๋‹ˆ๋‹ค.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰