numpy 🚀 - Tracker issue for BLIS support in NumPy

I see that BLIS has been included to site.cfg.
Can libFLAME be also supported?

homocomputeris on 2 Jul 2018

The description at https://www.cs.utexas.edu/%7Eflame/web/libFLAME.html may need fixing, but from that it's not clear to me that libFLAME actually implements the LAPACK API.

rgommers on 2 Jul 2018

@rgommers libflame does indeed provide netlib LAPACK APIs, and implementations for everything that libflame does not provide natively.

fgvanzee on 2 Jul 2018

FWIW, BLIS has made significant strides since this issue was opened. For example, BLIS now implements runtime configuration, and its configure-time configuration has been reimplemented in terms of the runtime infrastructure. BLIS now offers integrated BLAS test drivers in addition to its more comprehensive testsuite. Library self-initialization is also in place, as is monolithic header generation (single blis.h instead of 500+ development headers), which makes management of the installed product easier. It also follows a more standardized library naming convention for its static and shared library builds, includes an soname. Finally, its build system is a lot smarter vis-a-vis checking for compiler and assembler compatibility. And that's just what I can think of off the top of my head.

fgvanzee on 2 Jul 2018

👍4

@fgvanzee thanks. In that case I'm +1 to add support for it in numpy.distutils.

I'm really short on time at the moment, so cannot work on this probably till September at least. If someone else wants to tackle this, it should be relatively straightforward (along the same lines as gh-7294). Happy to help troubleshoot / review.

rgommers on 2 Jul 2018

That sounds like major progress. How far would you say are you from being a viable alternative to OpenBLAS for numpy/scipy (performance & stability wise)?

(Note that we still don't have a scipy-openblas on Windows for conda due to the Fortran mess: https://github.com/conda-forge/scipy-feedstock/blob/master/recipe/meta.yaml#L14)

rgommers on 2 Jul 2018

If you want a MSVC ABI compatible BLAS and LAPACK, there still aren't any easy, entirely open source options. Though nowadays with clang-cl and flang existing, the problem isn't compiler availability like it used to be, now it's build system flexibility and trying to use combinations that library authors have never evaluated or supported before. Ref https://github.com/flame/blis/issues/57#issuecomment-284614032

tkelman on 2 Jul 2018

@rgommers I would say that BLIS is presently a quite viable alternative to OpenBLAS. It is viable enough that AMD has abandoned ACML and fully embraced BLIS as the foundation of their new open-source math library solution. (We have corporate sponsors, and have been sponsored by the National Science Foundation for many years in the past.)

Performance-wise, the exact characterization will depend on the operation you're looking at, the floating-point datatype, the hardware, and the problem size range you're interested in. However, generally-speaking, BLIS typically meets or exceeds OpenBLAS's level-3 performance for all but the smallest problem sizes (less than 300 or so). It also employs a more flexible level-3 parallelization strategy than OpenBLAS is capable of (due to their monolithic assembly kernel design).

Stability-wise, I would like to think that BLIS is quite stable. We try to be very responsive to bug reports, and thanks to a surge of interest from the community over the last five or so months we've been able to identify and fix a lot of issues (mostly build system related). This has smoothed out the user experience for end-users as well as package managers.

Also, keep in mind that BLIS has provided (from day zero) a superset of BLAS-like functionality, and done so via two novel APIs separate and apart from the BLAS compatibility layer:

an explicitly typed BLAS-like API
an implicitly typed object-based API

This not only supports legacy users who already have software that needs BLAS linkage, but provides a great toolbox for those interested in building custom dense linear algebra solutions from scratch--people who may not feel any particular affinity towards the BLAS interface.

Born out of a frustration with the various shortcomings in both existing BLAS implementations as well as the BLAS API itself, BLIS has been my labor of love since 2012. It's not going anywhere, and will only get better. :)

fgvanzee on 2 Jul 2018

❤2

Ah thanks @tkelman. Cygwin:(:(

It would be interesting to hear some experiences from people using numpy compiled against BLIS on Linux/macOS though.

Thanks for the context @fgvanzee. Will be interesting for us to add libFLAME support and try on the SciPy benchmark suite.

rgommers on 2 Jul 2018

@rgommers Re: libflame: Thanks for your interest. Just be aware that libflame could use some TLC; it's not in quite as good of shape as BLIS. (We don't have the time/resources to support it as we would like, and almost 100% of our attention over the last six years has been focused on getting BLIS to a place where it could become a viable and competitive alternative to OpenBLAS et al.)

At some point, once BLIS matures and our research avenues have been exhausted, we will likely turn our attention back to libflame/LAPACK-level functionality (Cholesky, LU, QR factorizations, for example). This may take the form of incrementally adding those implementations to BLIS, or it may involve an entirely new project to eventually replace libflame. If it is the latter, it will be designed to take advantage of lower-level APIs in BLIS, thus avoiding some function call and memory copy overhead that is currently unavoidable via the BLAS. This is just one of many topics we look forward to investigating.

fgvanzee on 2 Jul 2018

👍2

I've run the benchmark from this article with NumPy 1.15 and BLIS 0.3.2 on an Intel Skylake without multithreading (I had a hardware instruction error with HT):

Dotted two 4096x4096 matrices in 4.29 s.
Dotted two vectors of length 524288 in 0.39 ms.
SVD of a 2048x1024 matrix in 13.60 s.
Cholesky decomposition of a 2048x2048 matrix in 2.21 s.
Eigendecomposition of a 2048x2048 matrix in 67.65 s.

Intel MKL 2018.3:

Dotted two 4096x4096 matrices in 2.09 s.
Dotted two vectors of length 524288 in 0.23 ms.
SVD of a 2048x1024 matrix in 1.11 s.
Cholesky decomposition of a 2048x2048 matrix in 0.19 s.
Eigendecomposition of a 2048x2048 matrix in 7.83 s.

homocomputeris on 2 Jul 2018

Dotted two 4096x4096 matrices in 4.29 s.

@homocomputeris Sorry, I've never heard the "dot" verb used to describe an operation on two matrices before. Is that a matrix multiplication?

fgvanzee on 2 Jul 2018

@fgvanzee What is the status of Windows support in BLIS these days? I remember getting it to build on Windows used to be mostly unsupported...

njsmith on 2 Jul 2018

@fgvanzee and yeah, numpy.dot is the traditional way of calling GEMM in Python. (Sort of an odd name, but it's because it handles vector-vector, vector-matrix, matrix-matrix all in the same API.)

njsmith on 2 Jul 2018

@njsmith The status of "native" Windows support is mostly unchanged. We still lack the expertise and interest in making such support happen, unfortunately. However, since Windows 10 was released, it seems there is an "Ubuntu for Windows" or bash compatibility environment of some sort available. That is probably a much more promising avenue to achieve Windows support. (But again, nobody in our group develops on on or uses Windows, so we haven't even looked into that option, either.)

fgvanzee on 2 Jul 2018

Ok one last post for now...

@homocomputeris for a benchmark like this it really helps to show some well known library too, like OpenBLAS, because otherwise we have no idea how fast your hardware is.

@fgvanzee Speaking of the native strided support, what restrictions on the strides do you have these days? Do they have to be aligned, positive, non-negative, exact multiples of the data size, ...? (As you may remember, numpy arrays allow for totally arbitrary strides measured in bytes.)

njsmith on 2 Jul 2018

@fgvanzee "bash for Windows" is effectively equivalent to running a Linux VM on Windows – a particularly fast and seamless VM, but it's not a native environment. So the good news is that you already support bash for Windows :-), but the bad news is that it's not a substitute for native Windows support.

njsmith on 2 Jul 2018

👍1

@njsmith My results are more or less the same as in the article.
Latest MKL, for example:

Dotted two 4096x4096 matrices in 2.09 s.
Dotted two vectors of length 524288 in 0.23 ms.
SVD of a 2048x1024 matrix in 1.11 s.
Cholesky decomposition of a 2048x2048 matrix in 0.19 s.
Eigendecomposition of a 2048x2048 matrix in 7.83 s.

I want to note, that I have no idea how to compile BLIS to use everything that my CPU can optimize and multithread. While MKL has things more or less out of the box.

homocomputeris on 2 Jul 2018

@njsmith Thanks for that update. I agree that nothing beats native OS support. I also agree that we need to see the benchmark run with other libraries for us to properly interpret @homocomputeris's timings.

Speaking of the native strided support, what restrictions on the strides do you have these days? Do they have to be aligned, positive, non-negative, exact multiples of the data size, ...? (As you may remember, numpy arrays allow for totally arbitrary strides measured in bytes.)

@njsmith Aligned? No. Positive? I think we lifted that constraint, but it's not been thoroughly tested. Exact multiples of the datatype? Yes, still.

I'm bringing @devinamatthews into the discussion. Months ago I told him about your request for byte strides, and he had some good points/questions at the time that I can't quite remember. Devin, can you recall your concerns about this, and if so, articulate them to Nathaniel? Thanks.

fgvanzee on 2 Jul 2018

@homocomputeris Would you mind rerunning the benchmark with a different value for size? I wonder if the value the author used (4096) being a power of two is a particularly bad use case for BLIS, and not particularly realistic for most applications anyway. I suggest trying 4000 (or 3000 or 2000) instead.

fgvanzee on 2 Jul 2018

@homocomputeris And did you say that the BLIS results are single-threaded, while the MKL results are multi-threaded?

njsmith on 2 Jul 2018

FWIW, I looked building BLIS on Windows some time ago. The primary pain point at the moment is the build system. It might be possible to get mingw's make to use clang to produce an MSVC compatible binary. I never got that running with the time I was able to spend on it, but it seems possible.

Within the actual source code, the situation isn't too bad. Recently they even transitioned to using macros for their assembly kernels, so that's one more barrier to Windows support eliminated. See https://github.com/flame/blis/issues/220#issuecomment-397842370 and https://github.com/flame/blis/pull/224. It seems like the source files themselves are a few more macros/ifdefs away from building on MSVC, but that's my perspective as an outsider. I also have no idea how to get the existing BLIS makefiles to work with MSVC.

insertinterestingnamehere on 2 Jul 2018

@insertinterestingnamehere Thanks for chiming in, Ian. You're right that the re-macroized assembly kernels are one step closer to being MSVC-friendly. However, as you point out, our build system was definitely not designed with Windows support in mind. Furthermore, does MSVC support C99 yet? If not, that's another hurdle. (BLIS requires C99.)

fgvanzee on 2 Jul 2018

Well, I gave the example above only to show that BLIS is comparable to others, that's why I haven't included anything more specific.

But as you ask :smiley:

Intel Core i5-6260U Processor with latest BIOS and whatever patches for Spectre/Meltdown
Linux 4.17.3-1-ARCH
everything is compiled with gcc 8.1.1 20180531
NumPy 1.15.0rc1
I've chosen a prime for matrix dimensions

Intel MKL 2018.3 limited to 2 threads (that is, my physical CPU cores):

Dotted two 3851x3851 matrices in 1.62 s.
Dotted two vectors of length 492928 in 0.18 ms.
SVD of a 1925x962 matrix in 0.54 s.
Cholesky decomposition of a 1925x1925 matrix in 0.10 s.
Eigendecomposition of a 1925x1925 matrix in 4.38 s.

BLIS 0.3.2 compiled with
CFLAGS+=" -fPIC" ./configure --enable-cblas --enable-threading=openmp --enable-shared x86_64

Dotted two 3851x3851 matrices in 3.82 s.
Dotted two vectors of length 492928 in 0.39 ms.
SVD of a 1925x962 matrix in 12.82 s.
Cholesky decomposition of a 1925x1925 matrix in 2.02 s.
Eigendecomposition of a 1925x1925 matrix in 67.80 s.

So, it seems that BLIS definitely should be supported by NumPy at least in Unix/POSIX/whatever-like systems, as I imagine Windows usecase as 'don't touch it if it works'
The only thing I don't know is the connection between MKL/BLIS and LAPACK/libFLAME. Intel claims they have many things optimized besides BLAS, like LAPACK, FFT etc.

@fgvanzee Why is a power of 2 bad for BLIS? It's quite common for collocations methods if one wants the fastest FFT.

homocomputeris on 2 Jul 2018

For numpy et al., it would be sufficient to manage the building in mingw/MSYS2 --- that's what we do currently with openblas on Windows (although this is sort of a hack in itself). It will limit the use to "traditional" APIs that don't involve passing CRT resources across, but that's fine for BLAS/LAPACK.

pv on 2 Jul 2018

@fgvanzee good point about C99. WRT the C99 preprocessor, surprisingly, even the MSVC 2017 preprocessor isn't fully caught up there. Supposedly they are currently fixing that (https://docs.microsoft.com/en-us/cpp/visual-cpp-language-conformance#note_D).

insertinterestingnamehere on 2 Jul 2018

@fgvanzee @njsmith Here is what we would need to do to support arbitrary byte strides:

1) Modify the interface. The most expedient thing that occurs to me is to add something like a stride_units flag in the object interface.
2) Refactor all of the internals to use only byte strides. This may not be a bad idea in any case.
3) When packing, check for data type alignment and if not, use the generic packing kernel.
a) The generic packing kernel would also have to be updated to use memcpy. If we can finagle it to use a literal size parameter then it shouldn't suck horribly.
4) When C is unaligned, also use a virtual microkernel that accesses C using memcpy.

This is just for the input and output matrices. If alpha and beta can be arbitrary pointers then there are more issues. Note that on x86 you can read/write unaligned data just fine, but other architectures (esp. ARM) would be a problem. The compiler can also introduce additional alignment problems when auto-vectorizing.

devinamatthews on 2 Jul 2018

@homocomputeris:

I didn't mean to imply that powers of two never arise "out in the wild," only that they are waaaay overrepresented in benchmarks, likely because us computer-oriented humans like to count in powers of two. :)
Those benchmark results are really similar. I would love it if the answer to this next question is "no", but is it possible that you accidentally ran both benchmarks with MKL (or BLIS) linked?
I completely agree that powers of two arise in FFT-related applications. I used to work in signal processing, so I understand. :)
My concern with BLIS not doing well with powers of two is actually a concern not unique to BLIS. However, it may be that the phenomenon we're observing is more pronounced with BLIS, and therefore a net "penalty" for BLIS relative to a ridiculously optimized solution such as MKL. The concern is as follows: when matrices are of dimension that is a power of two, it is likely that their "leading dimension" is also a power of two. (The leading dimension corresponds to the column stride when the matrix is column-stored, or the row stride when row-stored.) Let's assume for a moment row storage. When the leading dimension is a power of two, the cache line in which element (i,j) resides lives in the same associativity set as the cache line in which elements (i+1,j), (i+2,j), (i+3,j) etc live--that is, the same elements of subsequent rows. This means that when the gemm operation updates, say, a double-precision real 6x8 microtile of C, those 6 rows all map to the same associativity set in the L1 cache, and inevitably some of these get evicted before being reused. These so-called conflict misses will show up in our performance graphs as occasional spikes down in performance. As far as I know, there is no easy way around this performance hit. We already pack/copy matrices A and B, so this doesn't affect them as much, but we can't pack matrix C to some more favorable leading dimension without taking a huge memory copy hit. (The cure would be worse than the ailment.) Now, maybe MKL has a way of mitigating this, maybe switching to a differently-shaped microkernel that minimizes the number of conflict misses. Or maybe they don't, but I know that BLIS doesn't try to do anything to mitigate this. Hopefully that answers your question.
You're right that MKL is more than just BLAS+LAPACK functionality. However, keep in mind that MKL is a commercially held, closed-source solution. While it is available for non-commercial purposes "for free," but there's no guarantee that Intel won't make MKL unavailable to the public in the future, or start charging for it again. Plus, it's not really that great for us computer scientists who want to understand the implementation, or tweak or modify the implementation, or to build our research upon well-understood building blocks. That said, if all you want to do is solve your problem and move on with your day, and you're okay expressing it via BLAS, it's great. :)

fgvanzee on 2 Jul 2018

👍1

@fgvanzee Updated. My bad, for some reason it compiled with MKL, although I had put BLIS in config.
Are SVD and EVD implemented in LAPACK?

homocomputeris on 2 Jul 2018

@homocomputeris Yes, LAPACK implements SVD and EVD. And I expect MKL's implementations to be extremely fast.

However, EVD is a bit ambiguous: there is the generalized eigenvalue problem and a Hermitian (or symmetric) EVD. There is also a tridiagonal EVD, but usually very few people are interested in that, except for the people who are implementing Hermitian EVD.

fgvanzee on 3 Jul 2018

NumPy will be dropping Python 2.7 and 3.4 support at the end of 2018, that will allow use of more recent MSVC compilers that are C99 compliant, that is one of the reasons we are making that move before official 2.7 support runs out. See http://tinyurl.com/yazmczel for more info on C99 support in recent MSVC compilers. The support isn't complete (is anyone complete?) but may be sufficient. We will be moving NumPy itself to C99 at some point, as some (Intel) have requested that for the improved numerical support.

charris on 3 Jul 2018

@charris Thanks for this info, Charles. It's probably the case that whatever will be implemented will be sufficient, but we won't know for sure until we try it.

fgvanzee on 3 Jul 2018

In any case, BLIS doesn't need to beat MKL to start getting widespread adoption; its immediate competitor is OpenBLAS.

njsmith on 3 Jul 2018

👍2

BLIS doesn't need to beat MKL to start getting widespread adoption; its immediate competitor is OpenBLAS.

Couldn't agree more. MKL is in a league of its own.

@njsmith Just curious: How interested would the numpy community be in new APIs/functionality that allowed one to perform, say, gemm in mixed precision and/or mixed domain? That is, each operand could be of a different datatype, with the computation being performed in (potentially) a different precision than one or both of A and B?

fgvanzee on 3 Jul 2018

@fgvanzee Interested, yes, although the potential number of combinations boggles. But at the moment we do type conversions, which takes time and eats memory. I've also played with the idea of having optimized routines for integer types, although boolean might be the only type that wouldn't suffer irreparably from overflow.

And there is always float16. The ML folks might be more interested in those functionalities. @GaelVaroquaux Thoughts?

charris on 3 Jul 2018

@charris Yes, the number of cases can be daunting. There are 128 type combinations for gemm, assuming the four traditional floating-point datatypes, and that number excludes any combinatoric inflation from the transposition/conjugation parameters as well as matrix storage possibilities (in which case it would go up to 55,296).

Fully implementing the 128 cases even via a low-performance reference implementation would be noteworthy, and doing so in a higher-performing manner that minimized overhead would be quite special. (And, thanks to BLIS's object-based foundation, if we were to implement this, it would cost very little in terms of additional object code.)

If this interests you, please watch our project in the coming days/weeks. :)

fgvanzee on 3 Jul 2018

Is BLIS fork safe?

jakirkham on 3 Jul 2018

In any case, BLIS doesn't need to beat MKL to start getting widespread adoption; its immediate competitor is OpenBLAS.

Indeed. To be honest, I couldn't make NumPy+BLAS work on my system, but they seem to have very similar performance, judging by the article I cited before.
Probably, libFLAME can speed up LAPACK operations if linked to NumPy.

Another interesting question it to bench AMD's BLIS/libFLAME fork on their latest Zen CPUs to see if they got any improvement. It becomes even more interesting in the light of problematic Intel processors.

homocomputeris on 3 Jul 2018

On Linux and Windows currently numpy's official wheels include a pre-built copy of OpenBLAS, so on those platforms the quickest way to get a copy is pip install numpy. (Of course this isn't entirely fair because e.g. that version is built with a different compiler than the one used to build BLIS locally, but it can give one some idea.)

njsmith on 3 Jul 2018

Is BLIS fork safe?

@jakirkham I spoke to @devinamatthews about this, and he seems to think that the answer is yes. BLIS spawns threads on demand, (for example, when gemm is invoked) rather than maintaining a thread pool as OpenBLAS does.

We are curious though: What kind of threading model does numpy expect/prefer/depend on, if any? Do you need a threading pool to reduce overhead in the use case of calling many very small problems in succession? (BLIS's on-demand threading is not well-suited for that kind of usage.)

EDIT: In a similar vein, do you need pthreads, or are you amenable to OpenMP?

fgvanzee on 3 Jul 2018

Thanks for the info.

So is it using pthreads, OpenMP, both? FWIW there are known issues with GCC's OpenMP and forking that AFAIK are unresolved (cc @ogrisel ). Being as forking via the multiprocessing module in Python is one of the most common parallelism strategies, being able to use threads (preferably pthreads for the reason above) in a fork-safe way is pretty important in Python generally. Though there are more options these days with things like loky for example that use fork-exec.

Admittedly Nathaniel would know way more about this than I, but since I already started writing this comment, IIUC NumPy doesn't use threading itself except through external libraries (e.g. BLAS). Though NumPy does release the GIL often enough, which allows threading in Python to be a somewhat effective. Things like Joblib, Dask, etc. take advantage of this strategy.

As to thread pools, it is interesting that you ask about this as I was just profiling an ML technique to maximize performance the other day that does a series of BLAS routines in Python with NumPy/SciPy. Using OpenBLAS and a reasonable number of cores, this routine is able to saturate the cores for the length of the routine even though there is no explicit threading used in Python (including NumPy and SciPy) simply because OpenBLAS used a threadpool for the length of the routine. So yes thread pools are incredibly valuable.

On a different point, does BLIS handle dynamic architecture detection? This is pretty important for building artifacts that can be built once and deployed on a variety of different systems.

jakirkham on 3 Jul 2018

@jakirkham Thanks for your comments. I suppose the observed cost of not having a thread pool will depend on the size of the matrix operands you are passing in, and which operation you are performing. I assume gemm is a typical operation you target (please correct if I am wrong), but what problem size do you consider to be typical? I imagine the cost of BLIS's create/join model would manifest if you were repeatedly doing matrix multiplication where A, B, and C were 40x40, but maybe not as much for 400x400 or larger. (TBH, you probably shouldn't expect much speedup from parallelizing 40x40 problems to begin with.)

On a different point, does BLIS handle dynamic architecture detection? This is pretty important for building artifacts that can be built once and deployed on a variety of different systems.

Yes. This feature was implemented not too long ago, in the latter half of 2017. BLIS can now target so-called "configuration families," with the specific sub-configuration to be used chosen at runtime via some heuristic (e.g. cpuid instruction). Examples of supported families are intel64, amd64, x86_64.

fgvanzee on 3 Jul 2018

@fgvanzee I actually think of performance on small matrices as being one of the weak points of OpenBLAS, so it's a bit worrisome if BLIS is slower again... People definitely do use numpy in all kinds of situations, so we really value libraries that "just work" without needing tuning for specific cases. (I guess this is a bit different from the classic dense linear algebra setting, where the two most important cases are the algorithm author running benchmarks, and experts running week-long supercomputer jobs with specialized sysadmins.) For example, people often use numpy with 3x3 matrices.

For some cases handling this appropriately may just be a matter of noticing that you have a too-small problem and skipping threading entirely. For a 3x3 gemm, the optimal thing is probably to be as stupid as possible.

Though, this kind of tweaking (or even the thread-pool-versus-no-thread-pool thing) might be something that the community would jump in and do if BLIS ever starts getting widespread deployment.

Actually that reminds me: I was talking to someone last week who knew that in his library he wanted to invoke gemm single-threaded, because he was managing threading at a higher level, and he was frustrated that with standard blas libraries the only way to control this is via global settings that are pretty ride to call from inside a random library. Does BLIS's native API allow the user to control thread configuration on a call-by-call basis? I'm guessing yes, because IIRC you don't have any global configuration at all, outside of the BLAS compatibility layer?

njsmith on 3 Jul 2018

I actually think of performance on small matrices as being one of the weak points of OpenBLAS, so it's a bit worrisome if BLIS is slower again... People definitely do use numpy in all kinds of situations, so we really value libraries that "just work" without needing tuning for specific cases.

I understand wanting it to "just work," but I'm trying to be realistic and honest with you. If you have a 3x3 problem, and performance is really important to you, you probably shouldn't even be using BLIS. I'm not saying that BLIS will run 10x slower than a naive triple loop for 3x3's, just that that is not the problem size that BLIS's implementations excel at.

I'm surprised to hear that you are somewhat unsatisfied with OpenBLAS for small problems. What is your reference point? Is it just that performance is lower than it is for large problems? Attaining the highest performance possible for small problems requires a different strategy than for large problems, which is why most projects target one or the other, and then just handle the non-targeted cases suboptimally.

For some cases handling this appropriately may just be a matter of noticing that you have a too-small problem and skipping threading entirely. For a 3x3 gemm, the optimal thing is probably to be as stupid as possible.

I agree that a cutoff is ideal. But determining where that cutoff should be is non-trivial and most certainly varies by architecture (and across level-3 operations). So it's very possible, but it hasn't been implemented (or well thought-out) yet.

Though, this kind of tweaking (or even the thread-pool-versus-no-thread-pool thing) might be something that the community would jump in and do if BLIS ever starts getting widespread deployment.

Yes.

Actually that reminds me: I was talking to someone last week who knew that in his library he wanted to invoke gemm single-threaded, because he was managing threading at a higher level, and he was frustrated that with standard blas libraries the only way to control this is via global settings that are pretty ride to call from inside a random library.

Sequential gemm from a multithreaded application is one of our favorite use cases to think about. (Reminder: if sequential gemm is his only use case, he can simply configure BLIS with multithreading disabled. But let's assume he wants to build once and then decide threading later.)

Does BLIS's native API allow the user to control thread configuration on a call-by-call basis? I'm guessing yes, because IIRC you don't have any global configuration at all, outside of the BLAS compatibility layer?

Even if you configure BLIS with multithreading enabled, parallelism is disabled (one thread) by default. So that's a second way he can solve his problem.

But let's assume he wants to change the degree of parallelism at runtime. Parallelism in BLIS can be set at runtime. However, it still involves BLIS internally setting and reading environment variables via setenv() and getenv(). (See the Multithreading wiki for more details.) So I'm not sure if that is a deal-breaker for your friend. We want to implement an API where threading can be specified in a more programmatic way (that does not involve environment variables), but we're not quite there yet. Mostly it's about the interface; the infrastructure is all in place. Part of the problem is that we've all been trained over the years (e.g. OMP_NUM_THREADS, etc.) into specifying one number when parallelizing, and that's a gross oversimplification of the information that BLIS prefers; there are five loops in our gemm algorithm, four of which can be parallelized (five in the future). We can guess from one number, but it's usually not ideal since it depends on the hardware topology. So that is part of what is hampering progress on this front.

fgvanzee on 3 Jul 2018

@devinamatthews Any chance for adding TBLIS along the same vein?

dgasmith on 3 Jul 2018

If you have a 3x3 problem, and performance is really important to you, you probably shouldn't even be using BLIS. I'm not saying that BLIS will run 10x slower than a naive triple loop for 3x3's, just that that is not the problem size that BLIS's implementations excel at.

I'm not that worried about getting optimal performance for 3x3 problems (though obviously it's nice if we can get it!). But to give an extreme example: if numpy is compiled with BLIS as its underlying linear algebra library, and some user writes a @ b in their code using numpy, I would definitely hope that it doesn't end up running 10x slower than a naive implementation. Expecting users to rebuild numpy before multplying 3x3 matrices is too much to ask :-). Especially since the same program that multiplies 3x3 matrices in one place may also multiply 1000x1000 matrices in another place.

Reminder: if sequential gemm is his only use case, he can simply configure BLIS with multithreading disabled. But let's assume he wants to build once and then decide threading later.

He's shipping a Python library. He expects that his users will take his library, and combine it with other libraries and their own code, all running together in the same process. His library uses GEMM, and it's likely that some of that other code he doesn't control – or even know about – will also want to use GEMM. He wants to be able to control the threading for his calls to GEMM, without accidentally affecting other unrelated calls to GEMM that might be happening in the same process. And ideally, he'd be able to do this without having to ship his own copy of GEMM, since that's very annoying to do properly, and also it's kind of conceptually offensive that a program would have to include two copies of a large library just so you can get two copies of a single integer variable number_of_threads. Does that make more sense?

njsmith on 4 Jul 2018

I would definitely hope that it doesn't end up running 10x slower than a naive implementation.

Now that I think about it, I wouldn't be surprised if it was several times slower than naive for extremely small (3x3) problems, but the crossover point would be low, maybe as small as 16x16. And this is a problem that is ~~easy~~ straightforward to put a band-aid on. (I would want to do it for all level-3 operations.)

He's shipping a Python library. He expects that his users will take his library, and combine it with other libraries and their own code, all running together in the same process. His library uses GEMM, and it's likely that some of that other code he doesn't control – or even know about – will also want to use GEMM. He wants to be able to control the threading for his calls to GEMM, without accidentally affecting other unrelated calls to GEMM that might be happening in the same process. And ideally, he'd be able to do this without having to ship his own copy of GEMM, since that's very annoying to do properly, and also it's kind of conceptually offensive that a program would have to include two copies of a large library just so you can get two copies of a single integer variable number_of_threads. Does that make more sense?

Yes, that was helpful--thanks for the details. Sounds like he would be a perfect candidate for our as-yet-nonexistent threading API. (Personally, I think the environment variable convention is borderline madness and have expressed my feelings to my collaborators on more than one occasion. That said, it is rather convenient to a wide swath of HPC users/benchmarkers, so we'll have to come up with a proper runtime API that doesn't preclude the environment variable usage so that everyone stays happy.)

Would you mind conveying to him that this is definitely on our radar? I'll huddle with Robert. He be may interested enough to approve my spending time on this sooner rather than later. (The threading API has been on his wishlist for a while.)

fgvanzee on 4 Jul 2018

👍1

Building BLIS on windows with clang.exe targeting MSVC ABI is certainly possible. I spent a few hours and here are the changes. Number of changes required were surprisingly low compared to OpenBLAS.
Log is here. All BLIS and BLAS tests pass.

isuruf on 4 Jul 2018

👍4

@isuruf Thank you for taking the time to look into this. I don't consider myself qualified to assess whether this will work for the numpy folks (for more than one reason), so I'll defer to them in terms of it satisfying their needs.

As for the pull request, I have a few comments/requests (all quite minor), but I'll start a conversation on the pull request itself.

PS: Glad you were able to figure out the f2c errors. I imported just enough parts of libf2c into the BLAS test driver source (compiled as libf2c.a) so that things would link, though it did require some hackery in the header files. (I don't like maintaining Fortran code--to the point where I would rather look at f2c'ed C than Fortran--in case you couldn't tell.)

fgvanzee on 4 Jul 2018

Intel MKL 2018.3:

Dotted two 4096x4096 matrices in 2.09 s.
Dotted two vectors of length 524288 in 0.23 ms.
SVD of a 2048x1024 matrix in 1.11 s.
Cholesky decomposition of a 2048x2048 matrix in 0.19 s.
Eigendecomposition of a 2048x2048 matrix in 7.83 s.

A word on the benchmark used above. Robert reminded me that unless we know how this benchmark (or numpy?) implements Cholesky, EVD, and SVD, the results for those operations will not be all that meaningful. For example, if netlib LAPACK code is used for Cholesky factorization, the algorithmic blocksize will be wrong (far from ideal). Furthermore, depending on how numpy links to MKL, it may further distort things because MKL has its own implementation of of Cholesky factorization in addition to fast gemm, so it may not even be apples-to-apples when comparing to MKL.

I know you may have been using it simply to get a general idea of performance, and that's fine. Just understand that there may be devils hiding in the details that do not allow using it for anything more than very approximate comparisons.

fgvanzee on 4 Jul 2018

Generally numpy defers operations like those to whatever LAPACK or LAPACK-alike library is available. The MKL builds are almost certainly using the tuned fork of LAPACK that Intel ships inside MKL. The BLIS builds are probably using the reference LAPACK or something like it.

In a sense this is fair: if you're trying to choose a numpy configuration to do fast SVD, then it's relevant that MKL has a tuned version available and BLIS does not. (I believe OpenBLAS is in between: they ship a modified version of LAPACK with the library, but it's much closer to the reference implementation than MKL's version is.) But yeah for sure if you're trying to understand what BLIS is and why the results look like that, it's important to remember that there's a lot more more that goes into SVD/etc than just pure GEMM efficiency.

njsmith on 4 Jul 2018

In a sense this is fair:

I understand and agree. It's complicated by the fact that some people use benchmarks like the one proposed previously to measure absolute performance on a single hardware (that is, to judge how good an implementation is relative to peak performance), and some use it to compare implementations. I think Robert's comments about Cholesky et al. are more targeted at the former than the latter.

fgvanzee on 4 Jul 2018

@njsmith @insertinterestingnamehere @rgommers @charris Thanks to Isuru's quick work, we were able to refine and merge his appveyor-based Windows support into BLIS's master branch. Please take a look and let us know whether and to what degree this addresses the issue of Windows support for numpy.

fgvanzee on 5 Jul 2018

@fgvanzee Can you link to the PR?

njsmith on 5 Jul 2018

Can you link to the PR?

Sure, here it is.

fgvanzee on 5 Jul 2018

That initial build set up is great! At least in theory it should be enough to allow building numpy with BLIS on Windows. Things like building a dll or supporting MinGW can be added later.

One thing to note about the dependencies used in this case: pthreads-win32 is LGPL. IDK what, if anything, needs to be done about that though.

insertinterestingnamehere on 5 Jul 2018

Licensing is a difficult issue. I do know a few large companies whose legal policies make it difficult to use "copyleft" licences like LGPL while "permissive" licences are not a problem. Adding LGPL code to the NumPy/SciPy code base or wheels would definitely be a cause for concern, justifiably or not.

mattip on 5 Jul 2018

We're not adding it to the code base here. Shipping an LGPL component in wheels is not a problem. We're now shipping a gfortran dll that's GPL with runtime exception. See gh-8689

rgommers on 5 Jul 2018

Indeed. Besides wouldn’t be surprised if such companies also prefer MKL.

jakirkham on 5 Jul 2018

We just need to make sure the LGPL component is not statically linked, probably.

njsmith on 5 Jul 2018

@jakirkham Thanks for your comments. I suppose the observed cost of not having a thread pool will depend on the size of the matrix operands you are passing in, and which operation you are performing. I assume gemm is a typical operation you target (please correct if I am wrong), but what problem size do you consider to be typical? I imagine the cost of BLIS's create/join model would manifest if you were repeatedly doing matrix multiplication where A, B, and C were 40x40, but maybe not as much for 400x400 or larger. (TBH, you probably shouldn't expect much speedup from parallelizing 40x40 problems to begin with.)

Mainly GEMM and SYRK are typical. GEMV comes up sometimes, but often that is rolled into a large GEMM operation if possible.

It's not atypical for us to at least have 1 dimension that is on the order of 10^6 elements. The other could be anywhere from 10^3 to 10^6. So it varies quite a bit. Is there anything we would need to do to make this use the threadpool or does that happen automatically? Also how does the threadpool behave if the process is forked?

On a different point, does BLIS handle dynamic architecture detection? This is pretty important for building artifacts that can be built once and deployed on a variety of different systems.

Yes. This feature was implemented not too long ago, in the latter half of 2017. BLIS can now target so-called "configuration families," with the specific sub-configuration to be used chosen at runtime via some heuristic (e.g. cpuid instruction). Examples of supported families are intel64, amd64, x86_64.

To preface this a bit, we have old computers using Nehalem all the way up through Broadwell. Maybe a few newer personal machines have Kaby Lake. So being able to make the most out of the architecture we are provided is important while simultaneously not crashing if we are running on ancient machine by using unsupported intrinsics. Do Flame's sub-configurations support this range are there extra codes built in to dispatch the proper kernel on different architectures? How granular does this get? Is there anything we need to be doing during the build to ensure these kernels are present?

jakirkham on 6 Jul 2018

@jakirkham building using the x86_64 configuration gets you:

Penryn (incl. Nehalem and any other 64-bit SSSE3 Intel chip)
Sandy Bridge (incl. Ivy Bridge)
Haswell (+ Broadwell, Skylake, Kaby Lake, and Coffee Lake, although it may not be completely optimal for the last three)
Xeon Phi (1st and 2nd generation, we could do 3rd generation too if needed)
Skylake SP/X/W (probably will be very close to optimal on Cannon Lake too)
Bulldozer/Piledriver/Steamroller/Excavator
Ryzen/EPYC

Note that some of these do require a new enough compiler and/or binutils version, but only on the build machine.

AFAIK libflame doesn't have specialized kernels for any architecture, this is all up to BLIS.

devinamatthews on 6 Jul 2018

@jakirkham we would have to add a threadpool implementation. For now it is a basic fork-join implementation which should be fork-safe.

devinamatthews on 6 Jul 2018

AFAIK libflame doesn't have specialized kernels for any architecture, this is all up to BLIS.

This is mostly correct. While libflame does have a few intrinsics (currently only SSE) kernels for applying Givens rotations, those kernels don't belong in libflame and are only there because BLIS did not exist at the time they were written, and thus there was no other place to house them. Eventually, those kernels will be rewritten and updated and relocated into BLIS.

Do Flame's sub-configurations support this range are there extra codes built in to dispatch the proper kernel on different architectures? How granular does this get? Is there anything we need to be doing during the build to ensure these kernels are present?

@jakirkham I'm not sure what you mean, by "Flame" in this case. We have two products: libflame and BLIS. While libflame needs attention and likely a rewrite, virtually all of my attention over the last several years has been on BLIS.

If you textually replace "Flame" with "BLIS", the answer is "yes, mostly."

How granular does it get?

Not sure what you mean, but kernel support in BLIS is not as extensive as it is in OpenBLAS. For example, we oftentimes do not optimize complex domain trsm. Kernels are utilized, in some combination, by sub-configurations. Sub-configurations are chosen via cpuid. You get exactly the kernels "registered" by the sub-configuration. Please see the Configuration wiki for details on the nuts and bolts.

Is there anything we need to be doing during the build to ensure these kernels are present?

If you need runtime hardware detection, you target a configuration family (e.g. intel64, x86_64) at configure-time instead of a specific sub-configuration (or auto, which selects a specific sub-configuration). That's it.

Is there anything we would need to do to make this use the threadpool or does that happen automatically? Also how does the threadpool behave if the process is forked?

As I've said previously in this thread, BLIS does not use a thread pool. And I defer to Devin on fork safety (and he seems to think forking will be no problem).

fgvanzee on 6 Jul 2018

FYI, BLIS conda packages are available for linux, osx, windows on conda-forge. (Currently building dev branch, waiting for a release). Packages were built with pthreads enabled and x86_64 configuration.

isuruf on 7 Jul 2018

He's shipping a Python library. He expects that his users will take his library, and combine it with other libraries and their own code, all running together in the same process. His library uses GEMM, and it's likely that some of that other code he doesn't control – or even know about – will also want to use GEMM. He wants to be able to control the threading for his calls to GEMM, without accidentally affecting other unrelated calls to GEMM that might be happening in the same process.

@njsmith I've spoken to Robert, and he is fine with raising the priority of working on this. (Also, in looking into it briefly, I discovered a race condition in BLIS that would manifest any time two or more application threads attempt to use different degrees of parallelism. Fixing that race condition will simultaneously accommodate the API-related needs of people such as your friend.)

fgvanzee on 9 Jul 2018

@jakirkham One of our contacts at Intel, @jeffhammond, informs us that OpenMP implementations routinely employ a thread pool model internally. Thus, he discourages us from implementing thread pools redundantly within BLIS.

Now, it may be the case, as you suggest, that numpy needs/prefers pthreads, in which case maybe we're back to an actual fork/join happening underneath the fork/join-style pthreads API.

So is it using pthreads, OpenMP, both?

Also, I realized I forgot to answer this question. BLIS's multithreaded parallelism is configurable: it can use pthreads or OpenMP. (However, this is not to be confused with BLIS's unconditional runtime dependency on pthreads due to our reliance on pthread_once(), which is used for library initialization.)

fgvanzee on 11 Jul 2018

Unfortunately unless OpenMP implementations (e.g. GOMP) generally become more robust to forking, it's not really a safe option in Python. Particularly not in something as low in the stack as NumPy.

jakirkham on 12 Jul 2018

Unfortunately unless OpenMP implementations (e.g. GOMP) generally become more robust to forking, it's not really a safe option in Python. Particularly not in something as low in the stack as NumPy.

Fair enough. So it sounds like numpy would rely on the --enable-threading=pthreads configuration option for multithreading via BLIS.

fgvanzee on 12 Jul 2018

When compiled with pthreads is there some public API to get some programmatic control to avoid oversubscription issues when doing nested parallelism with Python processes / thread pools that call into numpy?

More specifically here are the kinds of public symbols I would look for:

https://github.com/tomMoral/loky/pull/135/files#diff-e49a2eb30dd7db1ee9023b8f7306b9deR111

Similar to what is done in https://github.com/IntelPython/smp for MKL & OpenMP.

ogrisel on 13 Jul 2018

When compiled with pthreads is there some public API to get some programmatic control to avoid oversubscription issues when doing nested parallelism with Python processes / thread pools that call into numpy?

@ogrisel Great question, Olivier. There is a way to set threading parameters at runtime, but it's currently done with global semantics. It's suboptimal because (aside from being global instead of on a per call basis to gemm or whatever) it feeds through environment variables.

I am currently working on a minor redesign of the underlying code that will allow someone to use one of the so-called "expert" sub-APIs of BLIS to pass in an extra data structure to gemm that will allow the caller to specify on a per-call basis what the parallelization strategy should be. The aforementioned data structure will allow people to specify a single number of threads, and let BLIS automatically do its best to figure out where to get parallelism, or a level of parallelism for each loop in the matrix multiplication algorithm (as BLIS prefers).

Either way, I think this new approach will meet your needs. I am halfway done with the changes already, so I think I'm only a week or so from getting the new feature pushed to github. Please let me know if this plan sounds like something that will work for you, and/or if you have any other concerns or requests on this topic.

PS: I took a look at the loky link you provided. Those symbols are all set up for a global setting of threads. My proposed solution would not preclude setting things globally, but it's designed so that that is not the only option. So someone who wants to use a maximum of 8 cores/threads could have an application spawn 2 threads, and each of those invoke an instance of gemm that gets 4-way parallelism. Or 3 application threads where two of them call gemm with 2-way parallelism with the third using 4-way parallelism. (You get the idea.)

fgvanzee on 13 Jul 2018

it feeds through environment variables.

What does it mean? Is it possible to use ctypes from Python to reconfigure the size of the default/global BLIS thread pool once it has already been initialized in a specific Python process?

ogrisel on 13 Jul 2018

it feeds through environment variables.

What does it mean?

@ogrisel What I mean by this is that we have a small API in BLIS to set or get the number of threads, similar to omp_get_num_threads() and omp_set_num_threads() in OpenMP. However, these BLIS API functions are implemented as calls to getenv() and setenv(). This is merely a historical artifact of the fact that one of the original use cases for BLIS was to set one or more environment variables in the shell (e.g. bash) and then execute a BLIS-linked application, so at the time it was convenient to simply build on that environment variable approach; it was never meant to be the final, perfect way of specifying the number of threads for parallelism.

Is it possible to use ctypes from Python to reconfigure the size of the BLIS thread pool once it has already been initialized in a specific Python process?

BLIS does not explicitly use a thread pool. Rather, we use a create/join model for extracting parallelism via pthreads. But yes, after BLIS has been initialized, the application (or calling library) can change the number of threads at runtime via a function call, at least in principle. (I don't know if this would work in practice, however, because I don't know how Python would handle BLIS attempting to set/get environment variables.) But as I mentioned previously, once I complete some modifications, the application/library will be able to specify a custom parallelization scheme on a per-call basis at the time that the level-3 operation is called. This would be done by first initializing a small struct datatype and then passing that struct into an extended "expert" version of the BLIS API for gemm, for example.) This would make the environment variables unnecessary for those who do not need/want them. Hopefully this answers your question.

fgvanzee on 13 Jul 2018

Thank you very much for the clarifications. The per-call expert API is interesting but would require numpy to maintain a specific API to expose that to its Python level callers. I am not sure we want that. I think for numpy users having a way to change the current value of the global (process-level) parallelism level is enough. I personally don't mind if it's achieved via changing the current value of the env variable as long as this change it taken into account for subsequent BLAS-3 calls.

ogrisel on 13 Jul 2018

@charris float16 might indeed be interesting for some machine learning workloads, at least at prediction time although I don't have personal experience with this.

ogrisel on 13 Jul 2018

I think for numpy users having a way to change the current value of the global (process-level) parallelism level is enough. I personally don't mind if it's achieved via changing the current value of the env variable as long as this change it taken into account for subsequent BLAS-3 calls.

Good to know. If that is the case, then BLIS is that much closer to being ready to use by numpy. (I would just like to make these in-progress modifications first, which also happen to fix a previously unnoticed race condition.)

fgvanzee on 13 Jul 2018

Probably better not to depend on the environment variables being rechecked on every call, because getenv isn't super fast, so it might make sense to remove it later. (Also, I don't think it's even guaranteed to be thread-safe?) But bli_thread_set_num_threads API calls should be fine, since even if BLIS stops calling getenv all the time then the API calls can be adjusted to keep working regardless.

In the longer run, I think it would make sense to start exposing some beyond-bare-BLAS APIs in numpy. One of the things that makes BLIS attractive in the first place is exactly that it provides features that other BLAS libraries don't, like the ability to multiply strided matrices, and there's work afoot to extend the BLAS APIs in a number of ways.

We wouldn't want to hard-code library-specific details in numpy's API (e.g., we wouldn't want np.matmul to start taking arguments corresponding to BLIS's JC, IC, JR, and IR parameters), but it might well make sense to provide a generic "how many threads for this call" argument that only works on backends that provide that functionality.

njsmith on 14 Jul 2018

👍2

One thing I haven't seen mentioned is the index precision. Most system supplied libraries seem to use 32 bit integers, which is a limitation for some applications these days. At some point is would be good if all the indexes were 64 bits, which probably requires that we supply the library. I don't know what we are currently doing in regard to index size. @matthew-brett Are we still compiling with 32 bit integers?

charris on 14 Jul 2018

👍1

@charris The integer size in BLIS is configurable at configure-time: 32 or 64 bits. Furthermore, you can configure the integer size used in the BLAS API independently from the internal integer size.

fgvanzee on 14 Jul 2018

Actually that reminds me: I was talking to someone last week who knew that in his library he wanted to invoke gemm single-threaded, because he was managing threading at a higher level, and he was frustrated that with standard blas libraries the only way to control this is via global settings that are pretty ride to call from inside a random library.

@njsmith I've fixed the race condition I mentioned previously and also implemented the thread-safe, per-call multithreading API. Please point your friend towards fa08e5e (or any descendant of that commit). The multithreading documentation has been updated as well and walks the reader through his choices, with basic examples given. The commit is on the dev branch for now, but I expect to merge it to master soon. (I've already put the code through most of its paces.)

EDIT: Links updated to reflect minor fix commit.

fgvanzee on 18 Jul 2018

As a possible addition to supported types, what about long double? I note some architectures are starting to support quad precision (still in software) and I expect that at some point extended precision will be replaced with quad precision on intel. I don't think this is immediately pressing, but I think that after all these years things are beginning to go that way.

charris on 18 Jul 2018

@charris We are in the early stages of considering support for bfloat16 and/or float16 in particular because of their machine learning / AI applications, but we are also aware of demand for double double and quad-precision. We would need to lay some groundwork for it to be feasible for the entire framework, but it is definitely on our medium- to long-term radar.

fgvanzee on 18 Jul 2018

@charris According to https://en.wikipedia.org/wiki/Long_double, long double can mean a variety of things:

80-bit type implemented in x87, using 12B or 16B storage
double precision with MSVC
double double precision
quadruple precision
Because the meaning is ambiguous and depends not just on hardware but the compiler used, it's an utter disaster for libraries, because the ABI isn't well-defined.

From a performance perspective, I don't see any upside to float80 (i.e. x87 long double) because there isn't a SIMD version. If one can write a SIMD version of double double in BLIS, that should perform better.

The float128 implementation in software is at least an order-of-magnitude slower than float64 in hardware. It would be prudent to write a new implementation of float128 that skips all the FPE handling and is amenable to SIMD. The implementation in libquadmath, while correct, isn't worth the attention of a high-performance BLAS implementation like BLIS.

jeffhammond on 18 Jul 2018

Yep, it's a problem. I don't think extended precision is worth the effort, and the need for quad precision is spotty, double is good for most things, but when you need it, you need it. I'm not worried about performance, the need isn't for speed but for precision. Note that we just extended support to an ARM64 with quad precision long double, a software implementation of course, but I expect hardware to follow at some point and it might be nice to have something tested and ready to go.

charris on 18 Jul 2018

The BLAS G2 proposal has some consideration of double-double and "reproducible" computations in BLAS. (Reproducible here means deterministic across implementations, but IIUC also involves using higher-precision intermediate values.)

njsmith on 18 Jul 2018

Excited to see this moving forward!

For the record, I'm the one @njsmith was referring to who's been interested in controlling the threading from software. My workloads are embarrassingly parallel at prediction time, and my matrix multiplications are relatively small. So I'd rather parallelise larger units of work.

I did some work about a year ago on packaging Blis for PyPi, and adding Cython bindings: https://github.com/explosion/cython-blis

I found Blis quite easy to package as a C extension like this. The main stumbling block for me was Windows support. From memory, it was C99 issues, but I might be remembering wrongly.

The Cython interface I've added might be of interest. In particular, I'm using Cython's fused types so that there's a single nogil function that can be called with either a memory-view or a raw pointer, for both the float and double types. Adding more branches for more types is no problem either. Fused types are basically templates: they allow compile-time conditional execution, for zero overhead.

I would be very happy to maintain a stand-alone Blis package, keep the wheels built, maintain a nice Cython interface, etc. I think it would be very nice to have it as a separate package, rather than something integrated within numpy. We could then expose more of Blis's API, without being limited by what other BLAS libraries support.

honnibal on 18 Jul 2018

@honnibal Sorry for the delay in responding on this thread, Matthew.

Thanks for your message. We're always happy to see others get excited about BLIS. Of course, we would be happy to advise whenever is needed if you decided to further integrate into the python ecosystem (an application, a library, module, etc.).

As for Windows support, please check out the clang/appveyor support for the Windows ABI that @isuruf recently added. Last I heard from him, it was working as expected, but we don't do any development on Windows here at UT so I can't keep tabs on this myself. (Though Isuru pointed out to me once that I could sign up for appveyor in a manner similar to Travis CI.)

Also, please let me know if you have any questions about the per-call threading usage. (I've updated our Multithreading documentation to cover this topic.)

fgvanzee on 31 Jul 2018

As of BLIS 0.5.2, we have a Performance document that showcases single-threaded and multithreaded performance of BLIS and other implementations of BLAS for a representative set of datatypes and level-3 operations on a variety of many-core architectures, including Marvell ThunderX2, Intel Skylake-X, Intel Haswell, and AMD Epyc.

So if the numpy community is wondering how BLIS stacks up against the other leading BLAS solutions, I invite you to take a quick peek!

fgvanzee on 1 Apr 2019

❤1

Very interesting, thanks @fgvanzee.

I had to look up Epyc - seems that that is a brand name based on the Zen (possibly updated to Zen+ at some point?) architecture. Perhaps better to rename to Zen? For our user base Ryzen/Threadripper are the more interesting brands, they may recognize Zen but probably not Epyc.

rgommers on 2 Apr 2019

Epyc is the name of the AMD server line. It is the successor to the AMD Opteron products of the past.

There is, unfortunately, no unique way for BLIS to label its architectural targets, because the code depends on the vector ISA (e.g. AVX2), the CPU core microarchitecture (e.g. Ice Lake) and the SOC/platform integration (e.g. Intel Xeon Platinum processor). BLIS uses microarchitecture code names in some cases (e.g. Dunnington) but that isn't better for everyone.

@fgvanzee You might consider adding the aliases that correspond to the GCC march/mtune/mcpu names...

jeffhammond on 2 Apr 2019

@rgommers The subconfiguration within BLIS that covers Ryzen and Epyc is actually already named zen, as it captures both products.

As for whether Ryzen/Threadripper or Epyc are more interesting brands (even to numpy users), I'll say this: if I could only benchmark one AMD Zen system, it would be the highest-end Epyc, because: (a) it uses a similar microarchitecture to that of Ryzen; (b) it gives me the maximum 64 physical cores (and, as a bonus, those cores are arranged in a somewhat novel, NUMA-like configuration); which (c) places maximal stress on BLIS and the other implementations. And that is basically what we did here.

Now, thankfully, there is no rule saying I can only benchmark one Zen system. :) However, there are other hurdles, particularly with regards to gaining access in the first place. I don't have access to any Ryzen/Threadripper systems at the moment. If/when I do gain access, I'll be happy to repeat the experiments and publish the results accordingly.

Jeff points out some of the naming pitfalls we face. Generally, we name our subconfigurations and kernel sets in terms of microarchitecture, but there is more nuance yet. For example, we use our haswell subconfiguration on Haswell, Broadwell, Skylake, Kaby Lake, and Coffee Lake. That's because they all basically share the same vector ISA, which is pretty much all the BLIS kernel code cares about. But that is an implementation detail that almost no users need to be concerned with. If you use ./configure auto, you will almost always get the best subconfiguration and kernel set for your system, whether they are named zen or haswell or whatever. For now, you still need to take a more hands-on approach when it comes to optimally choosing your threading scheme, and that's where the SoC/platform integration that Jeff mentions comes in.

@jeffhammond Thanks for your suggestion. I've considered adding those aliases in the past. However, I'm not convinced it's worth it. It will add significant clutter to the configuration registry, and the people who will be looking at it in the first place likely already know about our naming scheme for subconfigurations and kernel sets, and thus won't be confused by the absence of certain microarchitectural revision names in that file (or in the config directory). Now, if BLIS required manual identification of subconfiguration, via ./configure haswell for example, then I think the scales definitely tip in favor of your proposal. But ./configure auto works quite well, so I don't see the need at this time. (If you like, you can open an issue on this topic so we can start a wider discussion among community members. I'm always open to changing my mind if there is sufficient demand.)

fgvanzee on 2 Apr 2019

yes, naming is always complicated:) thanks for the answers @fgvanzee and @jeffhammond

rgommers on 3 Apr 2019

👍1

#13132 and #13158 are related

homocomputeris on 16 May 2019

👍1

The discussion got a bit carried away; what are the remaining issues that need to be resolved to officially support BLIS in numpy?

Naively, I tried to run numpy tests with BLIS from conda-forge (cf https://github.com/numpy/numpy/issues/14180#issuecomment-525292558 ) and for me, on Linux, all tests passed (but maybe I missed something).

Also tried to run test scipy test suite in the same env and there are a number of failures in scipy.linalg (cf https://github.com/scipy/scipy/issues/10744) in case someone has comments on that.

rth on 30 Aug 2019

Note that BLIS from conda-forge uses ReferenceLAPACK (netlib) as the LAPACK implementation which uses BLIS as the BLAS implementation and not libflame.

isuruf on 30 Aug 2019

👍2

About the BLIS option on conda-forge, am I right that it's single-threaded out of the box (unlike the OpenBLAS option)?

echuber2 on 2 Oct 2020

Probably better to move conda-forge discussions to conda-forge 🙂

jakirkham on 2 Oct 2020

👍1

Numpy: Tracker issue for BLIS support in NumPy

Most helpful comment

All 97 comments

Related issues