Pytorch: RuntimeError: DataLoader worker is killed by signal: Killed.

Created on 28 Jun 2018  ·  20Comments  ·  Source: pytorch/pytorch

Issue description

Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/media/zonstlab0/c3e7052f-24ed-4743-8506-fb7b8c6f0ba7/zonstlab0/myluo/Diagnosis/main_PVC.py", line 161, in
train(epoch)
File "/media/zonstlab0/c3e7052f-24ed-4743-8506-fb7b8c6f0ba7/zonstlab0/myluo/Diagnosis/main_PVC.py", line 104, in train
for batch_idx, (data, label) in enumerate(train_loader):
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 280, in __next__
idx, batch = self._get_batch()
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 259, in _get_batch
return self.data_queue.get()
File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()
File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 178, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 4161) is killed by signal: Killed.

Code

https://github.com/Lmy0217/MedicalImaging/blob/pve/main_PVC.py#L79

System Info

PyTorch version: 0.4.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: TITAN Xp
Nvidia driver version: 384.130
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.2
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries:
[pip3] numpy (1.14.3)
[pip3] pytorchviz (0.0.1)
[pip3] torch (0.4.0)
[pip3] torchfile (0.1.0)
[pip3] torchvision (0.2.1)
[conda] Could not collect

Most helpful comment

I've encountered the same problem recently.

If you're using the docker to run the PyTorch program, with high probability, it's because the shared memory of docker is NOT big enough for running your program in the specified batch size.

The solutions for this circumstance are:

  1. use a smaller batch size to train your model.
  2. exit the current docker, and re-run the docker with specified "--shm-size=16g" or bigger shared memory space depending on your machine.

Hope this could help those who have the same problem .:+1:

All 20 comments

your data loader worker process was killed by a signal

@SsnL what should I do?

@Lmy0217 You can try running with num_workers=0 and see if it gives you a better error (as it doesn't use subprocesses).

@SsnL I set num_workers=0 and then there are no errors.

Then something in your dataset __getitem__ doesn't like multiprocessing. What's in there?

@SsnL But when I set num_workers=1, my code is working sometime.

it seems like there isn't an isolated bug, but something in user's Dataset that might not like multiprocessing. Closing the issue to lead it into https://discuss.pytorch.org

I've encountered the same problem recently.

If you're using the docker to run the PyTorch program, with high probability, it's because the shared memory of docker is NOT big enough for running your program in the specified batch size.

The solutions for this circumstance are:

  1. use a smaller batch size to train your model.
  2. exit the current docker, and re-run the docker with specified "--shm-size=16g" or bigger shared memory space depending on your machine.

Hope this could help those who have the same problem .:+1:

I have run it in CPU and also met an error: RuntimeError: DataLoader worker (pid 6790) is killed by signal: Killed.

I followed @SsnL , it was killed and printed in terminal:

Count of instances per bin: [85308 31958]
Test: [ 0/5000] eta: 8:26:32 model_time: 5.8156 (5.8156) evaluator_time: 0.1168 (0.1168) time: 6.0785 data: 0.1460
Killed

Please, help me. Thanks a lot.

I met the same problem, even if i set num_workers=0, the training was ended unexpectedly with only information "Killed". would like to know if it's because of dataloader or memory problem or something else

I tested the dataloader alone and set num_workers=0, it was killed unexpectedly after several k iterations. it was ms coco dataset and the system memory is 64 GB so i think it should not be the memory problem.

same error with testing model

is it fixed or? is it still there?

@SystemErrorWang Same condition. I have more than 252G memory but still get the Dataloader killed. I monitored the system memory usage with htop command and the memory usage was always less than 30G if I set num_workers= 16. I am using the ubuntu 18.04 system and not docker. This definitely should not be the memory problem. (btw, the pytorch version is 1.4.0 on python 3.7.4)

So...How to fix it?

Increasing CPU memory and decreasing workers to 10 instead of 20 worked for me. Might also be related to using multiple GPUs with nn.DataParallel. I have only seen this when training big models that need 2-4 GPUs.

Has there been an official fix to this? I'm testing a very simple model on 8 cores and 32 GB of memory and I'm still getting this error no matter how low I set num_workers (other than 0)

@import-antigravity Could you share the dataset code? This is usually caused by a combination of environment and dataset code.

This is usually caused by a combination of environment and dataset code.

@SsnL
what do you mean? can you give some minimum examples to illustrate the bad cases?

@SsnL in this case the dataset was just drawing samples from a distribution:

from torch import Tensor
from torch.distributions import Distribution
from torch.utils.data import Dataset

class ProceduralDataset(Dataset, ABC):
    @property
    @abstractmethod
    def distribution(self) -> Distribution:
        pass

    def __init__(self, num_samples: int):
        self._n = num_samples
        self._samples = None

    def __getitem__(self, i):
        if self._samples is None:
            self._samples = self.distribution.sample((self._n,))
        return self._samples[i], Tensor()

    def __len__(self):
        return self._n

    def __iter__(self):
        self._i = 0
        return self

    def __iter__(self):
        self._i = 0
        return self

    def __next__(self):
        self._i += 1
        return self[self._i - 1]

I've encountered the same problem recently. I am running models locally.

However, I created multiple virtual environments with the virtualenv command. The error might be linked to this, since my computer did not have enough memory to run the dataloader with numerous workers.

When I removed the virtual environments, the error disappeared.

Created virtual environments like this

virtualenv ~/env
source ~/env/bin/activate

And removed them like this

rm -rf env
Was this page helpful?
0 / 5 - 0 ratings