I am not sure if this is a PyTorch bug, a scikit-learn bug or a numba, but this used to work in scikit-learn 0.20.3 and stopped working in the 0.21.0 series, so for now I am going to venture a guess that it is a regression in scikit learn.
When I do the following series of imports (minimized from the original import, which was import librosa
), loading the following program fails:
import torch
import soundfile
import scipy.signal
import numba
import sklearn
with
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/sklearn/__check_build/__init__.py", line 44, in <module>
from ._check_build import check_build # noqa
ImportError: dlopen: cannot load any more object with static TLS
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test_torch.py", line 5, in <module>
import sklearn
File "/opt/conda/lib/python3.6/site-packages/sklearn/__init__.py", line 75, in <module>
from . import __check_build
File "/opt/conda/lib/python3.6/site-packages/sklearn/__check_build/__init__.py", line 46, in <module>
raise_build_error(e)
File "/opt/conda/lib/python3.6/site-packages/sklearn/__check_build/__init__.py", line 41, in raise_build_error
%s""" % (e, local_dir, ''.join(dir_content).strip(), msg))
ImportError: dlopen: cannot load any more object with static TLS
___________________________________________________________________________
Contents of /opt/conda/lib/python3.6/site-packages/sklearn/__check_build:
_check_build.cpython-36m-x86_64-linux-gnu.so__pycache__ __init__.py
setup.py
___________________________________________________________________________
It seems that scikit-learn has not been built correctly.
If you have installed scikit-learn from source, please do not forget
to build the package before using it: run `python setup.py install` or
`make` in the source directory.
If you have used an installer, please check that it is suited for your
Python version, your operating system and your platform.
Downgrading to scikit-learn 0.20.3 makes the problem go away.
jenkins@260bf77532d0:~/workspace/test$ python
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sklearn; sklearn.show_versions()
System:
python: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0]
executable: /opt/conda/bin/python
machine: Linux-4.15.0-29-generic-x86_64-with-debian-jessie-sid
BLAS:
macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
lib_dirs: /opt/conda/lib
cblas_libs: mkl_rt, pthread
Python deps:
pip: 19.1.1
setuptools: 41.0.1
sklearn: 0.21.2
numpy: 1.16.4
scipy: 1.1.0
Cython: None
pandas: None
Also, you may be interested in:
jenkins@260bf77532d0:~/workspace/test$ pip list | grep numba
numba 0.43.1
jenkins@260bf77532d0:~/workspace/test$ pip list | grep torch
torch 1.2.0a0+ab800ad
The build of torch must be done with gcc 5.5.0 to cause this problem; other versions of gcc are known not to cause this problem.
For ease of reproduction, you can use the following docker image ezyang/scikit-learn-tls-repro:1
https://cloud.docker.com/repository/registry-1.docker.io/ezyang/scikit-learn-tls-repro Once in, follow the reproduction instructions as described above. (EDIT At time of writing, the Docker image is still uploading. Should be done soon.)
Thanks for the report. How did you build/install sklearn?
pip install scikit-learn
Do you have the log for that? Did it build from source or did you install a wheel?
Collecting scikit-learn
Using cached https://files.pythonhosted.org/packages/85/04/49633f490f726da6e454fddc8e938bbb5bfed
2001681118d3814c219b723/scikit_learn-0.21.2-cp36-cp36m-manylinux1_x86_64.whl
@ezyang you may want to share the Dockerfile
if that's possible.
If anyone is interested in reproducing this error the right docker incantation to use is something like this:
docker run -it ezyang/scikit-learn-tls-repro:1 bash
Note that you need to specify the tag i.e. 1
explicitly otherwise you get a cryptic error message (the 'latest' tag does not exist):
Unable to find image 'ezyang/scikit-learn-tls-repro:latest' locally
docker: Error response from daemon: manifest for ezyang/scikit-learn-tls-repro:latest not found.
I have no idea why this would happen, but I have seem numerous bug reports related to this e.g. with pytorch and OpenCV https://github.com/pytorch/pytorch/issues/2083 or OpenCV and Tensorflow https://github.com/tensorflow/models/issues/523. All in all I would guess that this is not a scikit-learn bug.
The fact that it depends on the order of import is fishy, for exemple this works in your docker image:
python -c 'import torch; import sklearn; import soundfile; import scipy.signal; import numba'
Note I tried to reproduce inside a conda environment (inside your docker image for good measure) and could not (scikit-learn 0.21.2
and pytorch 1.1.0
), so I guess this could be linked to some changes in pytorch dev version.
conda create -n test -c pytorch pytorch scikit-learn scipy numba scikit-learn -y
conda activate test
pip install soundfile
python -c 'import torch; import soundfile; import scipy.signal; import numba; import sklearn'
$ conda list
# packages in environment at /opt/conda/envs/test:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
blas 1.0 mkl
ca-certificates 2019.5.15 0
certifi 2019.6.16 py37_1
cffi 1.12.3 py37h2e261b9_0
cudatoolkit 10.0.130 0
intel-openmp 2019.4 243
joblib 0.13.2 py37_0
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
llvmlite 0.29.0 py37hd408876_0
mkl 2019.4 243
mkl-service 2.0.2 py37h7b6447c_0
mkl_fft 1.0.12 py37ha843d7b_0
mkl_random 1.0.2 py37hd81dba3_0
ncurses 6.1 he6710b0_1
ninja 1.9.0 py37hfd86e86_0
numba 0.45.0 py37h962f231_0
numpy 1.16.4 py37h7e9f1db_0
numpy-base 1.16.4 py37hde5b4d6_0
openssl 1.1.1c h7b6447c_1
pip 19.1.1 py37_0
pycparser 2.19 py37_0
python 3.7.3 h0371630_0
pytorch 1.1.0 py3.7_cuda10.0.130_cudnn7.5.1_0 pytorch
readline 7.0 h7b6447c_5
scikit-learn 0.21.2 py37hd81dba3_0
scipy 1.3.0 py37h7c811a0_0
setuptools 41.0.1 py37_0
six 1.12.0 py37_0
soundfile 0.10.2 pypi_0 pypi
sqlite 3.29.0 h7b6447c_0
tk 8.6.8 hbc83047_0
wheel 0.33.4 py37_0
xz 5.2.4 h14c3975_4
zlib 1.2.11 h7b6447c_3
I guess it would be useful and helpful to get a bisect on scikit-learn, if the problem reproduces on a dev build.
Generally speaking, my feeling is that the expertise on this kind of problems is on the PyTorch side. Personally, I never heard about static TLS before and I would guess this is the case of many other core scikit-learn devs although I could be wrong about the last statement.
IIUC you originally saw the problem with scikit-learn 0.21.2 and a pytorch dev version. I can not reproduce the problem on scikit-learn 0.21.2 and pytorch 1.1.0 as noted in https://github.com/scikit-learn/scikit-learn/issues/14485#issuecomment-517195977. If I was to try to understand this in more details, I would bisect on PyTorch.
The issue @ezyang linked has bunch of information on this TLS (thread local store) issue.
Here's some info I dug up before: https://github.com/pytorch/pytorch/issues/2575#issuecomment-369892859
;TLDR: Something in the chain of imports was not C/C++ compiled with -gPIC
flag. Importing that library causes a problem that turns all imports to "static TLS". There is a maximum amount of such "static TLS" slots (names I use here are surely incorrect). Exact N of slots depends on the OS, and how it was compiled.
In the linked pytorch issue 2575, there is a mention that it is OpenMP which was compiled without the flag causing the cascade.
This scikit-learn issue might be due to some new library being introduced or some change, eating just few more static TLS slots.
Note: Not a real expert. There might be other sources for this error than "one/some lib missing `-gPIC' flag when it was compiled". Haven't found one though.
Have there been any updates on this? I'm hitting this issue as well, also when importing librosa.
check https://github.com/pytorch/pytorch/issues/2575#issuecomment-523657178
I solved it by import sklearn,then import tensorflow.The import order result in this error.
Most helpful comment
I solved it by import sklearn,then import tensorflow.The import order result in this error.