Numpy: Error on Azure CI (Windows instance) with numpy 1.19.0

Created on 20 Jul 2020  ·  53Comments  ·  Source: numpy/numpy

Hello,
I have recently started experiencing problems when running tests for my project on Azure Pipelines with a Windows instance (vmImage: 'windows-2019'). Digging a little bit deeper (see this conversation https://developercommunity.visualstudio.com/content/problem/1102472/azure-pipeline-error-with-windows-vm.html?childToView=1119179#comment-1119179) we realised that the problem originated when we install numpy 1.19.0 instead of numpy 1.8.5 - I can see that numpy 1.19.0 was put on PyPI on June 20 and this is around the time when our tests started to fail. Forcing the environment to install numpy 1.8.5 as in previously successful builds seem to solve the problem.

I just wanted to report this as I assume this is something others may have started observing (but it is quite hard to pin-point that numpy is the issue... or at least looks like it is).

Looking forward to hearing from you,
and happy to do any change to my azure pipeline setup if that can help troubleshooting the problem.

Error message:

This build works fine with numpy 1.18.5: https://dev.azure.com/matteoravasi/PyLops/_build/results?buildId=46&view=logs&j=011e1ec8-6569-5e69-4f06-baf193d1351e
A build on the same commit with numpy 1.19.0 fails: https://dev.azure.com/matteoravasi/PyLops/_build/results?buildId=43&view=results

The error is very cryptic, what I explained above is more relevant I think. Here it is anyways:

2020-07-06T13:56:01.6879900Z Windows fatal exception: Current thread 0xaccess violation00001798
2020-07-06T13:56:01.6880280Z 
2020-07-06T13:56:01.6880589Z  (most recent call first):
2020-07-06T13:56:01.6880973Z   File "<__array_function__ internals>", line 6 in vdot
2020-07-06T13:56:05.3412520Z ##[debug]Exit code: -1073741819

All 53 comments

Does it fail consistently or only once in a while? Do you have any windows developers who can try to build the project on a local machine?

Hi,
thanks!

It failed consistently many times.. at that point I thought about asking Azure developers (my initial guess was that perhaps something had changed in their VMs setup).

This link has the discussion I had with a Microsoft developer who spotted the problem could have been numpy: https://developercommunity.visualstudio.com/content/problem/1102472/azure-pipeline-error-with-windows-vm.html?childToView=1119179#comment-1119179

Unfortunately I do not have anyone that can try building the project on a local windows machine :(

Then we will need a clear set of steps to reproduce

Would the azure-pipelines.yml work?

Here is what we use (https://github.com/equinor/pylops/blob/master/azure-pipelines.yml) commented out at the moment... you can see that it is a pretty standard setup, using Python 3.7, installing dependencies in requirements-dev.txt file (https://github.com/equinor/pylops/blob/master/requirements-dev.txt) and then running the tests.

As I mentioned already, if I comment this out and force numpy 1.18.5 everything runs, seems like it is the new 1.19 to break

What is the windows version major and minor version of the image running on Azure? i.e., what does systeminfo print for OS Version?

I could find the details of the Azure VMs used in Azure Pipelines here: https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/hosted?view=azure-devops&tabs=yaml and the link to installed software https://github.com/actions/virtual-environments/blob/master/images/win/Windows2019-Readme.md

I am not sure how to run systeminfo on a Azure pipeline, any suggestions?

It runs from the command line and dumps the output to terminal, so you can add it to your run as a command.

You could do this in a PR that runs on CI to see what it says. I am asking since there have been issues with the 19041 build of Windows and pip NumPy.

The answer was in the second link:

OS Version: 10.0.17763 Build 1282

So my idea bears no fruit.

You say you know there are some issues with the latest pip wheels for Windows, is it probably connected to that?

It is actually (probably) a Windows bug introduced in 19041. But you are on a much older version so this is not the issue.

It doesn't affect Conda NumPy, only pip NumPy because it seems to be some issue with Windows and OpenBlas.

I see :) I got an email that 1.9.1 has been released. I am going to try to retrigger the Azure pipeline which would now install the latest version and see if that works, will let you know

A bug in OpenBlas.

Here is a reproducing example:

import numpy as np
nr = 12000
v = np.random.randn(nr) + 1j * np.random.randn(nr)
np.vdot(v, v)
# also access violations
v @ v
# also access violations

The no symbols debugging information is:

Exception thrown at 0x0000000068DBB8F0 (libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll)
in python.exe: 0xC0000005: Access violation reading location 0x0000000000000000.

Note that the array has to be pretty big (10k passes, 12k does not) to trigger the bug.

Quick check:

$env:OPENBLAS_VERBOSE=2
$env:OPENBLAS_CORETYPE=Prescott

passes but the default kernel (Zen), as well as Haswell and Sandybridge, both have access violations.

Maybe worth checking that numpy HEAD, which uses a newer OpenBLAS 0.3.10, also fails. Or maybe you already did?

@mattip no I had not tried this yet. You mean installing bumpy directly from the master with pip install git+https://github.com/numpy/numpy? I can give it a try :)

And to your question @bashtage (Do the failing tests use numba at all? numba 0.50 has a bug on some versions of windows where it incorrectly makes use of an unavailable intrinsic. This caused crashes for me in another project.) which I got via email but can't seem to see in this thread... the test that crashes uses both numpy and pyfftw operations. As it crashes with this sudden message it is hard to tell at which line it really crashes. But i don't think pyfftw uses numba at all, at least its not one of their dependencies

I just tried with Installing the HEAD of NumPy directly from the GitHub repository and the windows build runs till completion - no sudden crash: https://dev.azure.com/matteoravasi/PyLops/_build/results?buildId=54&view=logs&j=011e1ec8-6569-5e69-4f06-baf193d1351e&t=bf6cf4cf-6432-59cf-d384-6b3bcf32ede2

Interestingly some libraries that have NumPy as dependency don’t seem to install properly (not sure why) and some tests fail for all OS, but at least it’s not a complete crash as before...

No error using nightly:

pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy

I just tried with Installing the HEAD of NumPy directly from the GitHub repository

This doesn't have OpenBLAS unless you explicitly build it in. By default you get a slow, generic BLAS with a pip install git+https://github.com/numpy/numpy.git.

Looks like we may want to upgrade OpenBLAS for 1.19.2, so marking this.

I think I might be experiencing the same issue on latest --pre build (numpy-1.20.0.dev0+a0028bc) on Azure:

Current thread 0x000003d0 (most recent call first):
  File "<__array_function__ internals>", line 5 in dot
  File "D:\a\1\s\mne\minimum_norm\inverse.py", line 732 in _assemble_kernel

The line in question is just:

K = np.dot(eigen_leads, trans)

If it helps, I could try saving the arrays to disk and getting them out via an Azure artifact.

That looks like it. You are using the same pre that I had working correctly.

You might want to add

$env:OPENBLAS_VERBOSE=2

or

set OPENBLAS_VERBOSE=2

to your template to know which kernel is being used.

If it helps, I could try saving the arrays to disk and getting them out via an Azure artifact.

It would probably be enough to know the dtypes and dimensions.

Okay, reproduced on a single run of just the failing test with just numpy+scipy+matplotlib+pytest (and deps) that writes the matrices being multiplied and then uploads the artifacts, here is the artifacts tab:

https://dev.azure.com/mne-tools/mne-python/_build/results?buildId=8330&view=artifacts&type=publishedArtifacts

The last .npz should be the failing one (27 MB). Locally on Linux it dots just fine:

>>> import numpy as np
>>> data = np.load('1595525222.9485037.npz')
>>> np.dot(data['a'], data['b']).shape
(23784, 305)
>>> data['a'].shape, data['a'].dtype, data['b'].shape, data['b'].dtype
((23784, 305), dtype('>f4'), (305, 305), dtype('float64'))
>>> data['a'].flags, data['b'].flags
(  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
,   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
)

Working on getting the OPENBLAS_VERBOSE working but it seems like every time I use pytest -s to not capture the output it actually passes. This might just be happenstance, though, we'll see...

Funny, I also see it with the reproducer above now.

I don't see it if I set OPENBLAS_CORETYPE to Prescott or Nehalem. I do see it with Zen, Sandybridge and Haswell.

I can't reproduce locally using the data from your npz on Windows.

I can't reproduce locally using the data from your npz on Windows.

FWIW on Azure I can reproduce it with the save-load-round-tripped data, because it now fails on the second-to-last line in the executed code here:

    import mne, os.path as op, time
    fname = op.join(op.dirname(mne.__file__), '..', 'bad', f'{time.time()}.npz')
    np.savez_compressed(fname, a=eigen_leads, b=trans)
    print(eigen_leads.flags)
    print(trans.flags)
    data = np.load(fname)
    np.dot(data['a'], data['b'])  # <-- fails here
    K = np.dot(eigen_leads, trans)   # <-- used to fail here before I added the above lines

So at least nothing is lost at the Azure end due to the np.savez/np.load steps.

I'm trying a run with OPENBLAS_CORETYPE: 'nehalem' to see if helps, though.

So maybe there are actually two different bugs here?

Also, setting OPENBLAS_VERBOSE: 2 doesn't seem to have any effect, not sure why

After setting verbose add a command

python -c "import numpy"

Pytest is probably eating this I would guess.

On Thu, Jul 23, 2020, 19:04 Eric Larson notifications@github.com wrote:

Also, setting OPENBLAS_VERBOSE: 2 doesn't seem to have any effect, not
sure why


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/16913#issuecomment-663151960, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/ABKTSRNS5QRT6CC3ZQ6DQYDR5B3TTANCNFSM4PCRVE6A
.

This command locally even does not give me any verbose output:

OPENBLAS_VERBOSE=2 python -c "import numpy as np, glob; data = np.load(glob.glob('bad/*.npz')[0]); a, b = data['a'], data['b']; print(np.dot(a, b).shape)"

But maybe my system OpenBLAS is too old. I'll push a commit to Azure to get it to run this by itself after it fails.

Looks like OPENBLAS_VERBOSE on Azure says "Core: Haswell". I don't know if that's correct or not, though.

I reported the error in https://github.com/xianyi/OpenBLAS/issues/2732 and they suggested it might be fixed in master, see https://github.com/xianyi/OpenBLAS/issues/2728 . No idea the best way to test this, though.

@mattip Do we know this is closed by MacPython/openblas-libs#35 ? Don't we need to wait until the next weekly is out?

@charris I think this issue is still open, and a backport will likely be needed.

Could someone with a reproducer try to build numpy with this commit to get the latest OpenBLAS binaries? So something like (mabe with typos)

git add remote mattip https://github.com/mattip/numpy.git
git fetch mattip  issue-16913
git checkout issue-16913
python tools/openblas_support.py
# copy the output openblas.a to a local directory and make sure numpy uses it
mkdir openblas
copy /path/to/openblas.a openblas
set OPENBLAS=openblas
python -c "from tools import openblas_support; openblas_support.make_init('numpy')"
pip install --no-build-isolation --no-use-pep517 .

You should have install gfortran with choco install -y mingw if you haven't already

... this is for windows

You should have install gfortran with choco install -y mingw if you haven't already

Is this only required for 32-bit?

https://github.com/numpy/numpy/blob/master/azure-steps-windows.yml#L29-L31

I'll try what you suggest above with a choco install -y mingw once I figure out what the /path/to/openblas.a is -- presumably from running tools/openblas_support.py (?).

Yes, python tools/openblas_support.py prints out where to find openblas.a

You need gfortran. The azure machines have mingw 64-bit installed. If you are 32-bits, the invocation is a bit different. You also need to set -m32 (but only for 32-bit).

I just verbatim copied most of https://github.com/numpy/numpy/blob/master/azure-steps-windows.yml using master branch of NumPy to first reproduce the error, and was successful in having it segfault.

I then switched to mattip/issue-16913 and it fails with a URL download error for:

https://anaconda.org/multibuild-wheels-staging/openblas-libs/v0.3.9-452-g349b722d/download/openblas-v0.3.9-452-g349b722d-win_amd64-gcc_7_1_0.zip

... looks like there is no 32-bit OpenBLAS for 64-bit Windows in:

https://anaconda.org/multibuild-wheels-staging/openblas-libs/files

I guess I could add the tag to get it to use 64-bit OpenBLAS?

2 are there and 1 is still being built. Should be up within the hour.

In the meantime I added:

        NPY_USE_BLAS_ILP64: '1'
        OPENBLAS_SUFFIX: '64_'

And it built just fine. No longer segfaults! I'll re-run it a few times just to be sure. Feel free to ping me when the 32-bit OpenBLAS Win64 libs are up and I can easily remove these lines and re-test.

Any change you run the full test suite :-)

python -c "import numpy; numpy.test('full')"

Looks like the 32 bit ones are up, and that also works.

I'll give the full test suite a run now

You shouldn't waste any more time on this other issue - I can wait until next week and test the weekly which will hopefully have the BLAS.

Note that we can run the nightly builds at anytime by pushing a commit to the master branch.

Ok, I'll wait until I see a new one to see if the issue with Windows 10 2004 is fixed.

@bashtage Any update on this?

OpenBLAS is still broken on the most recent release of Windows. It is very nonstandard to even get good debugging information because of the mixed to tool chain, at least for me.

Was this page helpful?
0 / 5 - 0 ratings