Numpy: Invalid data and segfault when reading past the size of file with fromfile on Ubuntu 16.04

Created on 31 Oct 2018  ·  4Comments  ·  Source: numpy/numpy


fromfile invalid data and sometimes segfault if reading past the end of a file i.e. it does not check if reading will go past the file end. This issue leads to a segfault on Ubuntu 16.04, but seems to not segfault on OSX.

Reproducing code example:

import numpy as np

def test_read_from_file():
    # create an empty file named `empty.bin`
    filename = 'empty.bin'
    open(filename, 'a').close()

    # read large chunk of data, past the end of the file
    dtype = [('data', '<f4', 500,)]
    count = 100000000

    with open(filename, 'rb') as fh:
        data = np.fromfile(fh, dtype, count)

    print(data.shape)

Error message:


Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Numpy/Python version information:

platform linux -- Python 3.6.6, pytest-3.8.2, py-1.6.0, pluggy-0.7.1

00 - Bug numpy.core

Most helpful comment

I believe the problem here is actually that on ubuntu you are getting a MemoryError that is being handled incorrectly and causing the segfault.

A PR with a fix is here: https://github.com/numpy/numpy/pull/12354

All 4 comments

Just to note, reproducable on 1.15.3. I guess we know the size, so this should just raise an error, or read the whole file. If this works silently on some systems, maybe we should put a release note just in case (I still would say we can just fix it).

EDIT: I would tend to error, just thought whole file might be an option because of indexing, but indexing is a bit special in this regard.

Agreed, raising an error sounds like a good idea. There still is something to be said about partial reads, which could be handled in two ways:

  1. read as many data records as possible till either the end of file or the count in fromfile is reached, but then we need a mechanism for explicitly returning the actual number of records that were read (implicitly this should be visible in the shape of the resulting array). An error can still be raised because it is not the normal usage scenario.
  2. don't allow partial reads i.e. raise if size to be read does not fit in the rest of the file.

I don't know which scenarios fits better with the numpy philosophy, but the first option sounds more useful.

I think an error is most reasonable. What I am not sure about right now is if fromfile supports file like objects that do not have a known size, or what currently happens in the case of non-empty sep kwarg.

@amuresan the code for fromfile is in C, but if you have a bit of time, we are always very happy about pull requests, and it seems like a reasonable difficulty to dabble a bit into the C (Python) API.

I believe the problem here is actually that on ubuntu you are getting a MemoryError that is being handled incorrectly and causing the segfault.

A PR with a fix is here: https://github.com/numpy/numpy/pull/12354

Was this page helpful?
0 / 5 - 0 ratings

Related issues

inducer picture inducer  ·  3Comments

dmvianna picture dmvianna  ·  4Comments

kevinzhai80 picture kevinzhai80  ·  4Comments

Foadsf picture Foadsf  ·  3Comments

qualiaa picture qualiaa  ·  3Comments