Pytorch: RuntimeError: DataLoader worker is killed by signal: Killed.

Created on 28 Jun 2018 · 20Comments · Source: pytorch/pytorch

Issue description

Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/media/zonstlab0/c3e7052f-24ed-4743-8506-fb7b8c6f0ba7/zonstlab0/myluo/Diagnosis/main_PVC.py", line 161, in
train(epoch)
File "/media/zonstlab0/c3e7052f-24ed-4743-8506-fb7b8c6f0ba7/zonstlab0/myluo/Diagnosis/main_PVC.py", line 104, in train
for batch_idx, (data, label) in enumerate(train_loader):
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 280, in __next__
idx, batch = self._get_batch()
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 259, in _get_batch
return self.data_queue.get()
File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()
File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 178, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 4161) is killed by signal: Killed.

Code

https://github.com/Lmy0217/MedicalImaging/blob/pve/main_PVC.py#L79

System Info

PyTorch version: 0.4.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: TITAN Xp
Nvidia driver version: 384.130
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.2
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries:
[pip3] numpy (1.14.3)
[pip3] pytorchviz (0.0.1)
[pip3] torch (0.4.0)
[pip3] torchfile (0.1.0)
[pip3] torchvision (0.2.1)
[conda] Could not collect

Source

Lmy0217

Most helpful comment

I've encountered the same problem recently.

If you're using the docker to run the PyTorch program, with high probability, it's because the shared memory of docker is NOT big enough for running your program in the specified batch size.

The solutions for this circumstance are:

use a smaller batch size to train your model.
exit the current docker, and re-run the docker with specified "--shm-size=16g" or bigger shared memory space depending on your machine.

Hope this could help those who have the same problem .:+1:

sdsy888 on 9 Apr 2019

👍96 🎉15 ❤14 🚀6

All 20 comments

your data loader worker process was killed by a signal

SsnL on 28 Jun 2018

👎77 😕5 😄5 👀3

@SsnL what should I do?

Lmy0217 on 28 Jun 2018

@Lmy0217 You can try running with num_workers=0 and see if it gives you a better error (as it doesn't use subprocesses).

SsnL on 28 Jun 2018

👍10 😕1 👎1

@SsnL I set num_workers=0 and then there are no errors.

Lmy0217 on 29 Jun 2018

👍15

Then something in your dataset __getitem__ doesn't like multiprocessing. What's in there?

SsnL on 30 Jun 2018

❤3 👍3

@SsnL But when I set num_workers=1, my code is working sometime.

Lmy0217 on 1 Jul 2018

👍2 😕1

it seems like there isn't an isolated bug, but something in user's Dataset that might not like multiprocessing. Closing the issue to lead it into https://discuss.pytorch.org

soumith on 1 Jul 2018

😕5 👎4

I've encountered the same problem recently.

The solutions for this circumstance are:

use a smaller batch size to train your model.
exit the current docker, and re-run the docker with specified "--shm-size=16g" or bigger shared memory space depending on your machine.

Hope this could help those who have the same problem .:+1:

sdsy888 on 9 Apr 2019

👍96 🎉15 ❤14 🚀6

I have run it in CPU and also met an error: RuntimeError: DataLoader worker (pid 6790) is killed by signal: Killed.

I followed @SsnL , it was killed and printed in terminal:

Count of instances per bin: [85308 31958]
Test: [ 0/5000] eta: 8:26:32 model_time: 5.8156 (5.8156) evaluator_time: 0.1168 (0.1168) time: 6.0785 data: 0.1460
Killed

Please, help me. Thanks a lot.

mainguyenanhvu on 20 Sep 2019

I met the same problem, even if i set num_workers=0, the training was ended unexpectedly with only information "Killed". would like to know if it's because of dataloader or memory problem or something else

I tested the dataloader alone and set num_workers=0, it was killed unexpectedly after several k iterations. it was ms coco dataset and the system memory is 64 GB so i think it should not be the memory problem.

SystemErrorWang on 14 Jan 2020

same error with testing model

twangnh on 29 Jan 2020

is it fixed or? is it still there?

qmpzzpmq on 22 Feb 2020

@SystemErrorWang Same condition. I have more than 252G memory but still get the Dataloader killed. I monitored the system memory usage with htop command and the memory usage was always less than 30G if I set num_workers= 16. I am using the ubuntu 18.04 system and not docker. This definitely should not be the memory problem. (btw, the pytorch version is 1.4.0 on python 3.7.4)

wizyoung on 16 Mar 2020

👍13

So...How to fix it?

Soar-Sir on 8 Apr 2020

👍1

Increasing CPU memory and decreasing workers to 10 instead of 20 worked for me. Might also be related to using multiple GPUs with nn.DataParallel. I have only seen this when training big models that need 2-4 GPUs.

gingsi on 15 Apr 2020

Has there been an official fix to this? I'm testing a very simple model on 8 cores and 32 GB of memory and I'm still getting this error no matter how low I set num_workers (other than 0)

import-antigravity on 25 Jul 2020

👍2

@import-antigravity Could you share the dataset code? This is usually caused by a combination of environment and dataset code.

SsnL on 26 Jul 2020

😕1

This is usually caused by a combination of environment and dataset code.

@SsnL
what do you mean? can you give some minimum examples to illustrate the bad cases?

Light-- on 13 Aug 2020

@SsnL in this case the dataset was just drawing samples from a distribution:

from torch import Tensor
from torch.distributions import Distribution
from torch.utils.data import Dataset

class ProceduralDataset(Dataset, ABC):
    @property
    @abstractmethod
    def distribution(self) -> Distribution:
        pass

    def __init__(self, num_samples: int):
        self._n = num_samples
        self._samples = None

    def __getitem__(self, i):
        if self._samples is None:
            self._samples = self.distribution.sample((self._n,))
        return self._samples[i], Tensor()

    def __len__(self):
        return self._n

    def __iter__(self):
        self._i = 0
        return self

    def __iter__(self):
        self._i = 0
        return self

    def __next__(self):
        self._i += 1
        return self[self._i - 1]

import-antigravity on 13 Aug 2020

I've encountered the same problem recently. I am running models locally.

However, I created multiple virtual environments with the virtualenv command. The error might be linked to this, since my computer did not have enough memory to run the dataloader with numerous workers.

When I removed the virtual environments, the error disappeared.

Created virtual environments like this

virtualenv ~/env
source ~/env/bin/activate

And removed them like this

rm -rf env

je-dbl on 17 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

When will you release a Windows branch for Pytorch?

Coderx7 · 3Comments

Error installing from source on Mac(10.12),CUDA=8.0, CUDNN=5.1.5

mishraswapnil · 3Comments

Feature Request: load_state_dict should take filenames

soumith · 3Comments

Confusing error msg when padding is set to float in nn.Conv1d

rajarshd · 3Comments

[build/nccl] failed to build libnccl on Debian unstable

cdluminate · 3Comments