Numpy: npyio.loadtxt is bytes-casting text file input, even with str dtype specified.

Created on 8 Nov 2012  ·  11Comments  ·  Source: numpy/numpy

Environment

  • Python: Version 3.3 (python.org release) on OS X Mountain Lion
  • numpy: Cloned from git master

When calling numpy.loadtxt on file containing strings as follows:

import numpy as np
datestxt = np.loadtxt("NYSE_dates.txt", dtype=str)
print(datetxt)

Where NYSE_dates.txt is simply a list of dates (could be anything really):

7/5/1962
7/6/1962
7/9/1962
...
12/29/2020
12/30/2020
12/31/2020

Output is:

["b'7/5/1962'" "b'7/6/1962'" "b'7/9/1962'" ..., "b'12/29/2020'"
 "b'12/30/2020'" "b'12/31/2020'"]

As you can see, all the strings have been bytes-casted, then stringified through conv, as you would get the same result from str(str('12/31/2020').encode('latin1')), per conv & compat.asbytes.

After looking at the code, it appears that all strings are bytes-casted with asbytes(...) pretty much throughout, as for example in split_line(...), so this must mean every routine in the module is broken.

00 - Bug numpy.lib

Most helpful comment

Pretty shocking that this hasn't been fixed for 5 years

All 11 comments

I also have that issue. This is very very annoying; basically you can't use loadtxt in Python3.

Temporary solution: I removed all asbytes() calls in the loadtxt method.

Yeah, I remember thinking something was fishy in there when I looked through the code.

For the record, I am running into the same issue with datetime64 inputs, leading to a parsing error of the form: Error parsing datetime string "b'2013-01-02'". To work around this, I had to create a converter for that column:

def decoder(input_bytes):
    return input_bytes.decode("ascii")

This would be fine in production code but is highly non-pretty for training material...

Pushing off to 1.11.

work-around - run iconv on the file first.

pushing off to 1.12.

I see that this is being pushed forward, but I find that is is a bug that should be addressed, and a fix seems to be easily implemented.

Pretty shocking that this hasn't been fixed for 5 years

It looks as though this is working as desired in NumPy 1.13.3 (though I'm not sure which PR fixed it). Can this issue be closed?

>>> import io
>>> import numpy as np
>>> f = io.StringIO("7/5/1962\n7/6/1962\n")
>>> np.loadtxt(f, dtype=str)
array(['7/5/1962', '7/6/1962'],
      dtype='<U8')
>>> np.__version__
'1.13.3'

Looks like this was fixed in #8349, in response to #8033.

Closing. Please reopen if needed.

Was this page helpful?
0 / 5 - 0 ratings