Pytorch: CPU memory gradually leaks when num_workers > 0 in the DataLoader

Created on 29 Oct 2018  ·  79Comments  ·  Source: pytorch/pytorch

Editor note: There is a known workaround further down on this issue, which is to NOT use Python lists, but instead using something else, e.g., numpy array or tensor directly.

🐛 Bug

CPU memory will leak if the DataLoader num_workers > 0.

To Reproduce

Run the following snippet:

from torch.utils.data import Dataset, DataLoader
from PIL import Image
from torchvision import transforms
import os

class DataIter(Dataset):
    def __init__(self):
        path = "path/to/data"
        self.data = []

        for cls in os.listdir(path):
            for img in os.listdir(os.path.join(path, cls)):
                self.data.append(os.path.join(path, cls, img))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        with Image.open(self.data[idx]) as img:
            img = img.convert('RGB')
            return transforms.functional.to_tensor(img)


train_data = DataIter()
train_loader = DataLoader(train_data, batch_size=300,
                          shuffle=True,
                          drop_last=True,
                          pin_memory=False,
                          num_workers=18)

for i, item in enumerate(train_loader):
    if i % 200 == 0:
        print(i)

Expected behavior

CPU memory will gradually start increasing, eventually filling up the whole RAM. E.g., the process starts with around 15GB and fills up the whole 128GB available on the system.
When the num_workers=0, RAM usage is constant.

Environment

PyTorch version: 1.0.0.dev20181028
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration: 
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti

Nvidia driver version: 390.67
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.4

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect

PIL.__version__
'5.3.0'

Additional info

There are around 24 million images in the dataset and all image paths are loaded into a single list as presented in the above code snippet.

I have also tried multiple Pytorch (0.4.0 and 0.4.1) versions and the effect is the same.

cc @ezyang @gchanan @zou3519 @SsnL

high priority dataloader memory usage molly-guard multiprocessing triaged

Most helpful comment

After some more investigation, I have found an exact scenario when the leak occurs. Consider the code example below:

from torch.utils.data import Dataset, DataLoader
import numpy as np
import torch


class DataIter(Dataset):
    def __init__(self):
        self.data_np = np.array([x for x in range(24000000)])
        self.data = [x for x in range(24000000)]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        data = self.data[idx]
        data = np.array([data], dtype=np.int64)
        return torch.tensor(data)


train_data = DataIter()
train_loader = DataLoader(train_data, batch_size=300,
                          shuffle=True,
                          drop_last=True,
                          pin_memory=False,
                          num_workers=18)

for i, item in enumerate(train_loader):
    if i % 1000 == 0:
        print(i)

If we use the self.data variable which is a standard Python list of ints, the data leak will occur. However, if the self.data_np variable is used, which holds the same data but in a form of a Numpy array, the leak will not occur.
Another observation is that the leakage is significantly less severe if the shuffle=False in the DataLoader.

All 79 comments

Do you see memory usage increasing when iterating, or before you even start to iterate?

@SsnL During the iteration only.

When we fix #13243 we should check if this one gets fixed too.

I've been experiencing something similar where memory usage continuously climbs until a OOM is triggered when using a batch_sampler with num_workers>0.

To Reproduce

import math

from torch.utils.data import DataLoader


class Sampler:
    def __init__(self, n=100000, batch_size=32):
        self.n = n
        self.batch_size = batch_size

    def __len__(self):
        return math.ceil(float(self.n)/self.batch_size)

    def __iter__(self):
        batch = []
        for i in range(self.n):
            batch.append(i)
            if len(batch) == self.batch_size:
                yield batch
                batch = []
        if batch:
            yield batch


N = 100000000
train_data = list(range(N))


def ok():
    train_sampler = Sampler(len(train_data))
    train_loader = DataLoader(train_data,
                              num_workers=0,
                              batch_sampler=train_sampler)

    for i, item in enumerate(train_loader):
        if i % 10000 == 0:
            print(i)


def leaky():
    train_sampler = Sampler(len(train_data))
    train_loader = DataLoader(train_data,
                              num_workers=8,
                              batch_sampler=train_sampler)

    for i, item in enumerate(train_loader):
        if i % 10000 == 0:
            print(i)


print('Starting ok')
ok()
print('ok done, starting leaky()')
leaky()
print('leaky done')

Environment

$ python3 collect_env.py
Collecting environment information...
PyTorch version: 0.4.0
Is debug build: No
CUDA used to build PyTorch: 9.1.85

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: GeForce GTX 1050 Ti with Max-Q Design
Nvidia driver version: 390.77
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.2
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect

@ezyang

When we fix #13243 we should check if this one gets fixed too.

The issue is still present in 1.0.0.dev20181105, where the #13243 is fixed.

After some more investigation, I have found an exact scenario when the leak occurs. Consider the code example below:

from torch.utils.data import Dataset, DataLoader
import numpy as np
import torch


class DataIter(Dataset):
    def __init__(self):
        self.data_np = np.array([x for x in range(24000000)])
        self.data = [x for x in range(24000000)]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        data = self.data[idx]
        data = np.array([data], dtype=np.int64)
        return torch.tensor(data)


train_data = DataIter()
train_loader = DataLoader(train_data, batch_size=300,
                          shuffle=True,
                          drop_last=True,
                          pin_memory=False,
                          num_workers=18)

for i, item in enumerate(train_loader):
    if i % 1000 == 0:
        print(i)

If we use the self.data variable which is a standard Python list of ints, the data leak will occur. However, if the self.data_np variable is used, which holds the same data but in a form of a Numpy array, the leak will not occur.
Another observation is that the leakage is significantly less severe if the shuffle=False in the DataLoader.

I face similar issue, but in my case it occurs with numpy array too. I am using Python 3.7 and PyTorch nightly release.

I don't know how multiprocessing really works under the hood of pytorch, but we have extensively discussed this "Memory Leak" issue (which probably isn't a memory leak!) on the fast.ai forums (https://forums.fast.ai/t/runtimeerror-dataloader-worker-is-killed-by-signal/31277/55?u=marcmuc). Preliminary findings which hopefully add some insight here (if this does NOT apply, please comment!):

Python Multiprocessing: There is no way of storing arbitrary python objects (even simple lists) in shared memory in Python without triggering copy-on-write behaviour due to the addition of refcounts, everytime something reads from these objects. The refcounts are added memory-page by memory-page, which is why the consumption grows slowly. The processes (workers) will end up having all/most of the memory copied over bit by bit, which is why we get the memory overflow problem. Best description of this behavior is here (SO).

Possible Solution:
Using Multiprocessing like now: in order for python multiprocessing to work without these refcount effects, the objects have to be made “compatible with” and wrapped in multiprocessing.Array before the process pool is created and workers are forked. This supposedly ensures, that the memory will really be shared and no copy-on-write happens. This explains how to do it for numpy arrays and this explains the reasoning behind it again. Don’t get confused by some false statements even by the authors of these good answers stating that copy-on-write makes all of this unnecessary, which is not true. One comment also points to this:

“Just to note, on Python fork() actually means copy on access (because just accessing the object will change its ref-count).”

I am not familiar with the torch.multiprocessing drop-in replacement that I understand pytorch uses, but I would assume it will also not be able to remove the core python refcount issue.

@mprostock torch.multiprocessing is simply Python multiprocessing, with a custom pickler. The custom pickler, whenever it encounters a torch.tensor, will automatically move it to shared memory, and hence atleast on the torch.tensor objects, no copy-on-write happens.

Thanks for the explanation! I have experimented with @bfreskura 's reproduction example and I think I can now pinpoint the problem:

The reproduction example by bfreskura above showed the difference between a regular python list and a numpy array. But the problem is not (only) the python list itself, the same happens in a numpy array of type object. Python lists store only references to the objects, the objects are kept separately in memory. Every object has a refcount, therefore every item in the list has a refcount.

Numpy arrays (of standard np types) are stored as continuous blocks in memory and are only ONE object with one refcount.

This changes if you make the numpy array explicitly of type object, which makes it start behaving like a regular python list (only storing references to (string) objects). The same "problems" with memory consumption now appear.

This would explain, why with regular lists (or numpy arrays of type object) we see the "memory leak", which actually is the copy-on-acces problem of forked python processes due to changing refcounts, not a memory leak.

So the problem probably (often) has got nothing to do with tensors or actual torch objects, but rather with the lists of filenames and dicts of labels, that are generally used within dataloaders/datasets.

I have created a notebook gist, if someone wants to quickly try it.
Look at the memory consumption (quick and dirty mem of total system, so minor influences by other processes, tried to keep system clean)

Memory-Consumption in GB with fixed length string array:
image

Memory-Consumption in GB with object array (only change!)
image

I am facing the same issue. It fills up my RAM very fast if the num_workers > 0.
I am deleting the variables which I feel are no longer needed in my code, also call gc.collect() on every iteration, but nothing helps.
Any workarounds?

Switching from dict to pandas and from lists to numpy arrays helps me

I am facing the same issue. It fills up my RAM very fast if the num_workers > 0.
I am deleting the variables which I feel are no longer needed in my code, also call gc.collect() on every iteration, but nothing helps.
Any workarounds?

Thanks for the reply. I will try that and hopefully, it works.

May I ask for the solution for this issue? I tried @samgd code on last daily built pytorch, and it was still leaking.

@Godricly See @mprostock and @soumith 's comments above. This is not really a leak, but an unfortunate behavior of using python native list. Using either torch tensor or np array will solve this memory problem.

@mprostock Do you mean that it is the copy created by copy-on-access use up the memory ,not something else? And doesn't the copy release after used?

Someone needs to step up and write a proper augmentation op for image datasets at least. The whole reason for all of these multiprocessing shenanigans is because vision datasets _have_ to decode and crop images on multiple cores. If there was an op that took care of decoding and geometric image transforms (resize, crop flip, shear, affine), and produced batch tensors directly, there would be no need to use multiprocessing at all, and further, non-geometric augmentation steps (colors, whitening/normalization, noise) could use intra-op parallelism to rip through the entire tensor. Care needs to be taken when designing such an op to expose transform parameters for each sample in the tensor to the outside, in order to enable parallel transformation of annotations (bounding boxes, masks, keypoints, etc).
Or better yet, make this a server, so that multiple processes (as well as other DL frameworks) could use it as well.

@mprostock thank you for the great explanation!

However, no solution has been proposed yet. Storing lists of filenames in Dataset object seems fair, so how one can use them? Did anyone figure it out?

@1e100 I believe @fmassa is working on adding more native image augmentation operations to torchvision, which should help with this problem.

Any update to this problem?

Using lots of shared memory solved the problem for me. Here's a hack to set shared memory inside the script in case you happen to be running the code inside a docker container and are unable to set the shared memory otherwise:

os.system(f"mount -o remount,size={args.shared_memory_size} /dev/shm")

Shared memory size could be e.g. half of total RAM, say `80G' for a biiig machine.

I found a workaround to the unable to open shared memory object </torch_22291_1137042840> in read-write mode error throw associated with this problem by changing the number of file descriptors allowed, although memory does _still_ creeps up to a certain point.

To check your allowed number of file descriptors, enter ulimit -a into bash and it will be under the -n tag. To up this limit for your current shell (i.e. if you don't have permissions for the server), run the following:
BASH: ulimit -n NEW_VALUE

To change it for the whole system, see here.

So if I understand correctly, the worker processes create a copy of the lengthy list of file paths each time they access the list? But then doesn't this temporary copy go out of scope (and is consequently destroyed) as soon as the __getitem__ function for that process returns? Why does the RAM consumption increase without bounds?

It'd be nice if someone made a short guide with some best practices on how to avoid this issue. With numerical values it's easy to replace Python lists with NumPy arrays, but it's not exactly clear how to mitigate the problem with (variable sized) strings.

In my case, I have a list of custom class objects that's created/populated in the constructor. It essentially just contains sets of file paths. Then inside __getitem__ I load those images, do some preprocessing, convert to torch Tensors and then explicitly cal del on the loaded images before returning. The problem is that adding some additional, seemingly harmless pre-processing steps introduces this out-of-bounds memory usage problem.

Py 3.8's mp.shared_memory may provide a nice enough workaround for sharing many non-tensor/nparray objects, e.g., shared list: https://docs.python.org/3.8/library/multiprocessing.shared_memory.html#multiprocessing.shared_memory.ShareableList. :)

disclaimer: i didn't actually read the docs closely.

Is there anything actionable we can do here? Do we have enough image transforms supported in torchvision to document moving some use cases over to that?

Just to clarify the point in here: implementing what @1e100 proposed is something we have in the roadmap of torchvision, but it's not in the top of our list and would probably require nested tensor support first.

That being said, this would not be a general fix to this issue: it would just bypass the need of multi processing in the data loading by using a different approach (e.g., transforms on the GPU).

cc @cpuhrsch as I saw someone mention nested tensor. (By the way, @cpuhrsch, can you make a module label for nested tensor and add yourself on it at https://github.com/pytorch/pytorch/issues/24422 ?)

Why has this bug not solved in a year?

@IMLHF See the first line of this issue description or the discussion above. Because this is not really a leak but an unfortunate design of python, which is out of our hands. Both pytorch and numpy have been trying to work around this by implementing custom serialization for tensors and ndarrays. Yet we can't really account for general data structure. This is open because we are implementing more utilities for users to workaround this issue.

Adding torch.cuda.empty_cache() at each end of iteration helps me solve this problem. The memory usage is fluctuating instead of accumulating after adding this.

perhaps we should add a warning.

@VitalyFedyunin do you have bandwidth to look at this? At a minimum, can we figure out if this is the same issue as https://github.com/pytorch/pytorch/issues/17499?

I think I solve this issue by using ndarray instead of using tensor in my project.

My before code was

def df2var(x):
    return (torch.LongTensor(token2id(x['Query'], max_char = max_length_char)), 
            torch.tensor(coll2id[x['Agg_Coll']], dtype = torch.long))

class Making_Dataset(Dataset):
    def __init__(self, input_dataframe):
        self.dataset = input_dataframe.apply(lambda x : df2var(x), axis = 1)

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, data_index):
        return self.dataset[data_index]

And I reformed the code as

class Making_Dataset(Dataset):
    def __init__(self, input_dataframe):
        self.text = np.array([token2id(q, max_char = max_length_char) for q in input_dataframe.Query])
        self.labels = np.array([coll2id[coll] for coll in input_dataframe.Agg_Coll])

    def __len__(self):
        return len(self.text)

    def __getitem__(self, data_index):
        return self.text[data_index], self.labels[data_index]

After fixing code, memory increasing issue in each epoch was gone in my project.
Because I don't know exactly what is causing this problem, any comments about this are welcome!

I'm seeing a similar problem with Torch 1.3.0 with CUDA 10 on Ubuntu 18.04. This was not a problem on an AWS machine with 64GB RAM, but on a local machine with 128GB RAM and 128GB swap, I can't even get through 150 epochs - the memory usage just keeps creeping from a few GB (expected) to 128+ GB.

Update

My issue was an insidious little bug - while recording training statistics, I was saving gradient information in addition to the pure values, which is both unnecessary and adds to your memory footprint on each epoch.

Py 3.8's mp.shared_memory may provide a nice enough workaround for sharing many non-tensor/nparray objects, e.g., shared list
https://github.com/pytorch/pytorch/issues/13246#issuecomment-513480017

Using lots of shared memory solved the problem for me
https://github.com/pytorch/pytorch/issues/13246#issuecomment-487042977

Hi, a bit late to this topic, but I face the same issues but with dictionaries.
Has anyone had success with these hacks?

This is still valid issue. Could someone provide the list of best practices how to use multiple workers in DataLoaders without causing memory leaks?

@marrrcin I think the best bet is to regard tensors as expensive, and so you should be cautious with how much you use them, especially if there's a chance they'll have gradient information.

For instance, it's probably worth storing everything as lists or numpy.ndarrays until you need to do torch operations

@AudreyBeard thanks for the reply. In my Dataset code I have everything stored as numpy/lists/strings/int and the only part where I use tensors is __getitem__ and later in collate_fn (applying padding). Should I create tensors with requires_grad set to false there? Once my code goes into num_workers>0 memory starts to leak.

Sorry for poor formatting, I am on mobile.

@marrrcin I usually only cast my input data (the image or signal or whatever) as a tensor in __getitem__. My labels and such are usually returned as lists. I'm not sure what kind of data you're using or if you're doing any special kind of padding, but I usually use torchvision.transforms in my __getitem__. For what it's worth, I very rarely implement a custom collate_fn.

A thought: I originally posted here because I was experiencing what I thought was a memory leak. It turned out I was hanging on to unnecessary data every epoch, and that had the symptoms a leak when it was really just very subtle variable management on my part. It took me a while to figure out exactly what was going on.

@AudreyBeard my case is not related to images / torchvision, I've used padding for tokens extracted from variable length texts, that's why I need to use collate_fn.

class PaddingCollateFn:
    def __call__(self, batch):
        sorted_batch = sorted(batch, key=lambda x: x[0].shape[0], reverse=True)
        sequences = [x[0] for x in sorted_batch]
        sequences_padded = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=True)

        attention_masks = [torch.tensor([1 for _ in x[0]]) for x in sorted_batch]

        attention_masks_padded = torch.nn.utils.rnn.pad_sequence(
            attention_masks, batch_first=True
        )
        lengths = torch.tensor([len(x) for x in sequences])
        labels = torch.tensor([x[1] for x in sorted_batch])

        return (sequences_padded, lengths, attention_masks_padded), labels

Should I drop original tensors after padding them (for example using del)? I thought that once the collate_fn finishes, they will be out of scope and removed as there will be no references to them.

I met this in Pytorch version 1.3.1....When I training ImageNet....Does anyone has some ideas?
In my case I use num_workers of 24, normally it cost about 110G memory in epoch 1, but when I training the second epoch, it will cost all of my memory, and system will kill the dataloader .....I really don't konw why....

For me the issue was that I was already converting numpy arrays to torch tensors in the dataloader __getitem__

Numpy arrays should only be converted to torch tensors in the trainer loop, just before being sent to the model. Otherwise the tensors will make the shared memory grow out of bounds.

You can monitor the shared memory by running the command watch -n .3 df -h
The shared memory corresponds to the line /dev/shm
The used amount should not increase after each epoch.

I have same problem

this bug not solved in pytorch 1.4.0

I also have the same problem

I too face the same problem, even after:
1) deleting all unecessary variables
2) using numpy arrays instead of lists
3) using gc.collect()

@annukkaa and others: just using np.array(list_of_paths) is not enough as it stores the list of strings as lots of objects. Use np.array(list_of_paths).astype(np.string_) to cast the array into a square shaped byte array (and make sure to convert from bytes to str when actually using the string). That should help. Also setting the shared memory to a high value, say, 100GB.

I haven't seen it mentioned explicitly in this thread so I thought I'd share my solution.
In my case I had a custom class object and a list of strings in my dataset which was accessed every iteration and quickly depleted my CPU memory.
By wrapping the class and list using the multiprocessing Manager object which handles shared states I was able to eliminate the memory leak.

To tie in with the minimal example the code would look like this.

from torch.utils.data import Dataset, DataLoader
import torch
from multiprocessing import Manager


class DataIter(Dataset):
    def __init__(self):
        manager = Manager()
        self.data = manager.list([x for x in range(24000000)])

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        data = self.data[idx]
        return torch.tensor(data)


train_data = DataIter()
train_loader = DataLoader(train_data, batch_size=300,
                          shuffle=True,
                          drop_last=True,
                          pin_memory=False,
                          num_workers=18)

for i, item in enumerate(train_loader):
    if i % 1000 == 0:
        print(i)

There is some overhead since the objects are pickled, but is a good alternative to having the memory explode.

Is this thing every going to be fixed????

Seem the issue is still opened.

Using ndarrays also does not help. It bumps up the CPU RAM approx 4 times with workers as zero.
Tried del yet no signifiant improvement.

Hi All,

I have just tried a solution for this and this works like an absolute beauty.

For me I am storing imagenet data as numpy arrays stored locally.

I have written my custom dataset as ---

`import torch
from torch.utils import data
import numpy as np

class DataSetBuilder(data.Dataset):
"""TinyImagenet dataset."""

def __init__(self, rootpath, train=True, transform=None):
    """
    Args:
        rootpath: Path to the pytorch file with annotations.
        root_dir (string): Directory with all the images.
        transform (callable, optional): Optional transform to be applied
            on a sample.
    """
    self.path = rootpath
    self.transform = transform
    self.train = train
    # Load input data
    if self.train:
        self.X_ = np.load(self.path +'x_train.npy')
    else:
        self.X_ = np.load(self.path +'x_test.npy')
    # Load target data
    if self.train:
        self.y_ = np.load(self.path +'y_train.npy')
    else:
        self.y_ = np.load(self.path +'y_test.npy')

def __len__(self):
    if self.train:
        dataFile = self.path + 'x_train.npy'
    else:
        dataFile = self.path + 'x_test.npy'

    data = np.load(dataFile)
    return data.shape[0]

def __getitem__(self, idx):
    if torch.is_tensor(idx):
        idx = idx.tolist()
    X = self.X_[idx, :, :, :]
    y = self.y_[idx]
    if self.transform is not None:
        X = self.transform(X)
    return X, torch.from_numpy(y).type(torch.LongTensor)`

Instead of loading data in the __getitem__ i am loading it at the time of object building which means it does not load eveytime the same numpy arrays into memory rather load it at once at the object creation time.

Hope this helps!

Leave comment if this works for you... :-)

Hi @varinder-singh,
Nice that you found a solution. I don't see how this is different from the numpy example given by @bfreskura earlier? Your __getitem__ also slice data from a numpy array.
Maybe I'm reading the code wrong, would you mind clarifying for me why they would affect memory consumption differently?

After having this issue in my current project, and reading through this thread, I feel it might be useful to add my own thoughts and provide a somewhat versatile solution.

First things first:

1) Considering all that I can observe here @mprostock 's diagnosis is correct. Your work saved me a lot of time digging by myself.
2) Of course @soumith 's response is also correct, but it not applicable in this case due to the reasons stated by @mprostock 's later post on object arrays.

This is not a pyTorch problem. It is a Python problem and, thus, should be solved there. But since the problem is caused by the reference counting, which is an integral part of Python's memory management, this might not happen any time soon. Some of the above proposed workarounds are interesting, but why go to such lengths? Assuming the task is to jointly access a number of variable length sequences like filenames, you do not need to invent anything new. Simply use numpy to pack the sequences and perform an indirect lookup. To understand what I mean, see the below code, which completely avoids the problem discussed in this thread.

@mprostock and @smolendawid Strings are essentially just sequences of integers, which is a type that can be handled easily in numpy. The below example is tailored to share any list of strings (e.g. filenames of images) among multiple data-loaders.
@marrrcin You asked for a best practice. This is robust and works for any list of variable length sequence. In my current project I use a slightly more sophisticated variant of this for multi-dimensional data where each dimension has a variable length.
@SsnL This implicitly solves the problem you discuss with @zhiweifang in /issues/20433 without using fancy Python 3.8 constructs.

import numpy as np
import torch
from typing import Union

# --- UTILITY FUNCTIONS ---
def string_to_sequence(s: str, dtype=np.int32) -> np.ndarray:
    return np.array([ord(c) for c in s], dtype=dtype)

def sequence_to_string(seq: np.ndarray) -> str:
    return ''.join([chr(c) for c in seq])

def pack_sequences(seqs: Union[np.ndarray, list]) -> (np.ndarray, np.ndarray):
    values = np.concatenate(seqs, axis=0)
    offsets = np.cumsum([len(s) for s in seqs])
    return values, offsets

def unpack_sequence(values: np.ndarray, offsets: np.ndarray, index: int) -> np.ndarray:
    off1 = offsets[index]
    if index > 0:
        off0 = offsets[index - 1]
    elif index == 0:
        off0 = 0
    else:
        raise ValueError(index)
    return values[off0:off1]


# --- OUR DATASET CODE STARTS HERE ---
class MyDataset(torch.utils.data.Dataset):

    def __init__(self):
        strings = [
            'I like', # You can use np.int8 for ASCII strings.
            'chocolate',
            '我喜欢', # If you use anything that is not standard ASCII,
            '巧克力', # need to use np.int16, or even np.int32.
        ]

        # Convert each string to sequence of codepoints (integer),
        # and then pack them into a numpy array.
        seqs = [string_to_sequence(s) for s in strings]
        self.strings_v, self.strings_o = pack_sequences(seqs)

    def __len__(self): return 4

    def __getitem__(self, i):
        # Use indirect lookup to fetch the i-th sequence. This only uses integer numpy
        # array lookups, which avoids that the objects are subsequently replicated by
        # child processes.
        seq = unpack_sequence(self.strings_v, self.strings_o, i)
        string = sequence_to_string(seq)
        # ACTION NEEDED: You probably do not want to return the string itself ;-).
        return string


m = MyDataset()
for i in range(len(m)):
    print(i, '=', m[i])

# Output
# -------
# 0 = I like
# 1 = chocolate
# 2 = 我喜欢
# 3 = 巧克力

I was able to solve this problem and like to share my 2 cents. I basically followed the ideas pointed out by @harpone and others that strings were the problem. I had 2 problematic arguments in my Dataset class:

  1. a numpy array of strings (casting using .astype(str) didn't help)
  2. a dictionary from strings to numpy vectors

I had to fix both 1 and 2 to stop the memory leak. For 1, my strings are actually hashes to access the numpy vectors in the dictionary, so I converted all the strings to integers since I had a fixed size dictionary.

For 2, I converted the dictionary to use integer keys, however the memory leak still persisted. What worked was actually not passing the dictionary to the Dataset class at all, but just returning the interger key in __getitem___ and doing the dictionary indexing/pormotion to Pytorch tensor/promotion to GPU in my train loop.

Any way to get the dataloader processes to re-initialize themselves every epoch and clean up all of that leaked memory?

@Pozimek they already reinitialize every epoch.

So what is the best practice NOW?

@wangchust: the solution proposed by @bashimao has been working beautifully for me, even on moderately large (25M+ text sequences) datasets.

@wangchust: the solution proposed by @bashimao has been working beautifully for me, even on moderately large (25M+ text sequences) datasets.

Me too. The solution of @bashimao works very well.

Hello everyone, I'am here again. Does anyone meet "OverflowError: cannot serialize a bytes object larger than 4 GiB" when dataloader workers fork from the main process?

Hello everyone, I'am here again. Does anyone meet "OverflowError: cannot serialize a bytes object larger than 4 GiB" when dataloader workers fork from the main process?

@wangchust If you are serializing, you probably do something wrong. Each process will deserialize the 4 or however large your object is in Gigbytes and reconstruct the serialized objects. Hence, you will replicate memory and eventually run out of it if there are many parallel processes. The entire point of the measures proposed by myself and also others in this thread is to avoid replicating memory. As said in my first sentence, I believe you do something wrong, probably on a pretty fundamental level.

Custom tensor-backed string array seems to help https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57:

import torch

class TensorBackedImmutableStringArray:
    def __init__(self, strings, encoding = 'utf-8'):
        encoded = [torch.ByteTensor(torch.ByteStorage.from_buffer(s.encode(encoding))) for s in strings]
        self.cumlen = torch.cat((torch.zeros(1, dtype = torch.int64), torch.as_tensor(list(map(len, encoded)), dtype = torch.int64).cumsum(dim = 0)))
        self.data = torch.cat(encoded)
        self.encoding = encoding

    def __getitem__(self, i):
        return bytes(self.data[self.cumlen[i] : self.cumlen[i + 1]]).decode(self.encoding)

    def __len__(self):
        return len(self.cumlen) - 1

    def __list__(self):
        return [self[i] for i in range(len(self))]

Maybe sth like this is even worth for inclusing into core PyTorch

Anyone have any luck getting dictionaries to work and not leak?
I saw this post above, but I'd like access to some type of hash table within the worker instead of doing the work outside of it as that comment suggests.

I'm considering one of the following:

  • multiprocessing Manager dict
  • shared memory or mmap
  • homegrown numpy-based hash table without any pyobjects.

Shared memory looks like most promising and native to python option. I'm curios why you are using dict? Common pattern here is to have list of items (usually strings) and index them.

For me it was originally a list of dicts (a list of example metadata, every example was a dict)

Got it. Generally dicts makes it even harder, as memory access pattern is not sequential. I'm thinking about adding Fork-safe datastructures support (not sure if build in or separate repo).

Shared memory looks like most promising and native to python option. I'm curios why you are using dict? Common pattern here is to have list of items (usually strings) and index them.

@VitalyFedyunin Thanks for the tip. I may try the shared memory first then.
The reason for a dict is right now is for O(1) lookup of elements for a random sampling function in data generation step. More specifically, "triplet mining" where the dict is keyed on user_id, and the values are a list of positive examples associated with that user. See here for an example.

@marrrcin I usually only cast my input data (the image or signal or whatever) as a tensor in __getitem__. My labels and such are usually returned as lists. I'm not sure what kind of data you're using or if you're doing any special kind of padding, but I usually use torchvision.transforms in my __getitem__. For what it's worth, I very rarely implement a custom collate_fn.

A thought: I originally posted here because I was experiencing what I thought was a memory leak. It turned out I was hanging on to unnecessary data every epoch, and that had the symptoms a leak when it was really just very subtle variable management on my part. It took me a while to figure out exactly what was going on.

@AudreyBeard thanks. this was helpful and resolve my problem.

Something I'm curious about is (1) why shuffle impacts the memory consumption so much and (2) why the total memory usage seems to be much more than number of processes * size of the data attribute.

In the example by @bfreskura the size of self.data is 24e7 integers, which is roughly 1.83GB. If we bring that down to 24e5 (so the script can quickly run to completion), then the size of the data object is roughly 18.92MB.

In the Python list case, if set shuffle=False, I measure that the process consumes 298.17 MB. I then set shuffle=True, I measure the process consumes 1.44GB.

So, over 18 worker + 1 main parent process, even if all of the data is copied to every process, that should only be at most 359.48 additional MB of RAM. How is it the case that when shuffle=True I'm getting almost 4 times that amount? I imagine it must have to do with sequential vs random memory access and the resulting page faults, but I'm curious if anyone can more precisely describe what's going on here.

For reference my modifications (fire CLI + memory consumption reporting) to @bfreskura's script is here:

https://gist.github.com/Erotemic/3f017de31529dc64c1a54948f37da1d5

Random access will force Python to write object counters back into the memory, causing copy-on-write of memory frames. Sequential access potentially can be optimized by not writing unchanged counter (likely depends on GC cycles). Also it is much safer (so far) to estimate by peak usage which is number of workers * (all objects sizes + number of object * python object pointer+counter size). We are currently working on the solution to prevent full memory copy, but it requires significant re-architecture and will take time.

@VitalyFedyunin thanks for the explanation, but I'm afraid I'm not quite getting it yet :smile:

I've managed to resolve the above issue by using a numpy array instead of a list, and e.g. a np.string_ type square numpy byte array, but now I'm facing an apparently similar issue with webdataset (https://github.com/tmbdev/webdataset/issues/24#issuecomment-709101119). I'm apparently not running out of shm, but as @tmbdev pointed out earlier in the webdataset thread, the problem could be the _number_ of shared memory segments...

Do you have any tips on how to debug this issue and/or some temporary hacks around it? I've tried ipcs but that doesn't show anything useful for me (I think). lsof /dev/shm shows some info on shm objects and sizes, but I'm not quite sure what they mean...

For me, measuring proportional set size (pss in psutil) helped to measure the size of the problem. I worked around it by custom StringArray and DictArray classes: https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57 that pack strings/dicts in flat tensors (no NumPy used)

@wangchust: the solution proposed by @bashimao has been working beautifully for me, even on moderately large (25M+ text sequences) datasets.

Sorry maybe I'm missing something about github usage but I see no solution from @bashimao on this thread, just a comment. Could anyone point me to it please?

Much simpler to just cast to np.string_ (note NOT str etc). Say you have

strings = ['hello', 'world']

then do

strings_byte = np.array(strings).astype(np.string_)

The result will then be a single square byte array (note the dtype):

array([b'hello', b'world'], dtype='|S5')

You then have to encode back to string when choosing a string from that, e.g. str(strings_byte[0], encoding='utf-8').

Note that this would not work:

strings_byte = np.array(strings).astype(str)

note the dtype:

array(['hello', 'world'], dtype='<U5')

That's not a square byte array i.e. not a single object.

It would be helpful given the persistence of this issue and the number of times I or my colleagues have bumped into this issue if we could have some recipe for determining whether or not this is the cause. Having read this thread quite thoroughly it seems there are good suggestions for mitigating the issue (https://github.com/pytorch/pytorch/issues/13246#issuecomment-436632186, https://github.com/pytorch/pytorch/issues/13246#issuecomment-612396143), but also some confusing behaviour (https://github.com/pytorch/pytorch/issues/13246#issuecomment-708067670).

  1. Is it sufficient to run the data loader alone in a while True loop and observe memory usage to rule out this issue? I presume if the memory grows as the loop runs then we can conclude that either the our dataset object has some pathological behaviour which accumulates objects or that we are running into this issue?
  2. The thing I really can't understand from this thread is why this is an issue if you have dataset classes that only hold a few MB of data. If I understand correctly, the issue outlined here is that any python objects that are accessed by the dataset will eventually be copied over to the memory of a worker thread. If I have a simple dataset class which loads videos from a list of paths stored as a field in the dataset class, why would that be an issue? The shared data is so small as to be negligible. Why does the use of shuffle=True in the dataloader result in such higher memory usage as outlined in https://github.com/pytorch/pytorch/issues/13246#issuecomment-708067670?

A solution that works for me - https://t.me/snakers4/2577

A solution that works for me - https://t.me/snakers4/2577

This is nice! I guess the only advantage of my method in https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57 is that tensor-packed objects can be shared between DDP workers with DDP primitives (i.e. we read a giant dataset object only in one thread, then scatter the tensor-packed dataset object to other DDP ranks). In the same way, DDP master worker can gather some tensor-packed string arrays from the DDP ranks.

Another real-world occurence of this bug: https://github.com/NVIDIA/NeMo/issues/1467

Was this page helpful?
0 / 5 - 0 ratings

Related issues

eliabruni picture eliabruni  ·  3Comments

a1363901216 picture a1363901216  ·  3Comments

cdluminate picture cdluminate  ·  3Comments

soumith picture soumith  ·  3Comments

ikostrikov picture ikostrikov  ·  3Comments