Lapack: Processor requirements for LAPACK

Created on 4 Jun 2021  ·  16Comments  ·  Source: Reference-LAPACK/lapack

We are working on a translation of LAPACK to .NET. We wrote a FORTRAN compiler which successfully translates all of LAPACK, including all tests. On real data types (almost) all tests pass. On complex data we are seeing a few precision issues, still.

Example:
XEIGTSTZ < zec.in - fails due to underflow in ZLAHQR.
Steps to reproduce: ZGET37 -> knt == 31, ZHSEQR -> ZLAHQR -> at the end of the second QR step (ITS == 2) the following code causes underflow (on certain registers, see below)

TEMP = H( I, I-1 )
    IF( DIMAG( TEMP ).NE.RZERO ) THEN
        RTEMP = ABS( TEMP)    ! <-- underflow on TEMP = (~1e-0173, ~1e-0173)
        IF (RTEMP .EQ. RZERO) RTEMP = CABS1(TEMP)
        H( I, I-1 ) = RTEMP
        TEMP = TEMP / RTEMP
        IF( I2.GT.I )

Our compiler targets the .NET CLR. Its JIT decides to use SSE registers for ABS(TEMP), which leads to the underflow in the intermediate calculation of the magnitude. Ifort (as another example) uses floating point registers in this situation, hence does not underflow (because of its larger length: 80 bits). I am trying to get a clear(er) picture of what to expect from LAPACK regarding which precision / number range it requires from the compiler / processor at runtime.

Are all tests for double precision designed to require 64 bit registers at least ? Or are they designed in a way to succeed for the set of popular FORTRAN compilers available today? (In the first case above issue (and similar others) may require attention. Should I file an issue for them?)

I looked for some specification but couldn't find it yet. Any link would also be appreciated. Thanks in advance!

Question

All 16 comments

The underflow itself is not the true problem. After underflow, the algorithm switches to CABS1, which is less prone to underflow. The problem that creates is that TEMP will not be exactly unitary, leading to roundoff in Z.

A possible solution is to prescale using CABS1 and then correct using ABS (because of the first scaling, ABS should no longer overflow). (I don't get the underflow on my machine, so i can't test it for you)

IF (RTEMP .EQ. RZERO) THEN
    RTEMP = CABS1(TEMP)
    H( I, I-1 ) = RTEMP
    TEMP = TEMP / RTEMP
    RTEMP = ABS( TEMP)
    H( I, I-1 ) = H( I, I-1 )*RTEMP
    TEMP = TEMP / RTEMP
END IF

I think the tests are definitely designed to succeed for the set of popular FORTRAN compilers, because that is simply how they are run. Predicting under/overflow is incredibly hard. At least in my case, these subroutines are designed by simply testing them (using the popular compilers) thoroughly and fixing any over/underflow we find.

Thank you! This is very helpful.
We had tried to recover from underflow using CABS1. But our attempt was not going far enough. Your suggestion seems to do much better. Using ...

*
*        Ensure that H(I,I-1) is real.
*
         TEMP = H( I, I-1 )
         IF( DIMAG( TEMP ).NE.RZERO ) THEN
            RTEMP = ABS( TEMP)
            IF (RTEMP .EQ. RZERO) THEN 
                RTEMP = CABS1(TEMP)
                H( I, I-1 ) = RTEMP
                TEMP = TEMP / RTEMP
                RTEMP = ABS( TEMP)
                H( I, I-1 ) = H( I, I-1 )*RTEMP
            ELSE 
                H( I, I-1 ) = RTEMP
            END IF
            TEMP = TEMP / RTEMP
            IF( I2.GT.I )
     $         CALL ZSCAL( I2-I, DCONJG( TEMP ), H( I, I+1 ), LDH )
            CALL ZSCAL( I-I1, TEMP, H( I1, I ), 1 )
            IF( WANTZ ) THEN
               CALL ZSCAL( NZ, TEMP, Z( ILOZ, I ), 1 )
            END IF
         END IF
*
  130 CONTINUE

... this iteration completes successfully (even when using SSE registers for ABS()).

I think the tests are definitely designed to succeed for the set of popular FORTRAN compilers, because that is simply how they are run. Predicting under/overflow is incredibly hard. At least in my case, these subroutines are designed by simply testing them (using the popular compilers) thoroughly and fixing any over/underflow we find.

The tests suite is of tremendous help! My rough estimation is that far less than 1% of the tests are affected by this or similar overflow issues (when using our compiler). Making the tests even more robust against under-/overflow could help to bring LAPACK to more platforms. Our (failed) attempt above is just one example, which clearly shows, that we would hardly be able to come up with a fix on our side, though. Before opening multiple related issues I would like to start a discussion, whether or not there is interest in such journey and what would be a good approach.

Thanks for the improvement @hokb and @thijssteel! Should I write a PR with the modifications or are you willing to do that, @hokb?

Given my limited experience with the project I would appreciate your effort and the chance to take your PR as a guideline for pot. future PRs from us... (if that's ok?)

Hi @hokb,

I looked for some specification but couldn't find it yet. Any link would also be appreciated. Thanks in advance!

I am not sure anything is specified anywhere.

Our compiler targets the .NET CLR. Its JIT decides to use SSE registers for ABS(TEMP), which leads to the underflow in the intermediate calculation of the magnitude. Ifort (as another example) uses floating point registers in this situation, hence does not underflow (because of its larger length: 80 bits). I am trying to get a clear(er) picture of what to expect from LAPACK regarding which precision / number range it requires from the compiler / processor at runtime.

Bold statement: If all computations are done using IEEE 64-bit arithmetic, then LAPACK should work.

LAPACK does not expect 80-bit register to come help its computation at any times. The algorithms are designed with 64-bit arithmetic in mind. Now, as mentioned by @thijssteel, LAPACK is tested with various compilers/architectures, and these compilers/architectures use 80-bit registers at times, and we might think our algorithms only need 64-bit all along, but they do not, and they, in effect, do require an 80-bit.

We have not done anything systematic in our journey to go after these issues. In general, we are happy enough when the algorithms pass the test suite, and, if there is some help from 80-bit register, so be it.

Are all tests for double precision designed to require 64 bit registers at least ? Or are they designed in a way to succeed for the set of popular FORTRAN compilers available today? (In the first case above issue (and similar others) may require attention. Should I file an issue for them?)

My rough estimation is that far less than 1% of the tests are affected by this or similar overflow issues (when using our compiler).

Oh my. 1%? That is a scary large number.

The tests are testing a lot around underflow and overflow region so it could be expected that the tests are much more likely in term of triggering this issue than users' code but still.

Making the tests even more robust against under-/overflow could help to bring LAPACK to more platforms.

Portability to more platforms is one interest indeed. Another interest is extended precision with package such as GMP where, as I understand, the precision is fixed throughout the computation. (So for example you are 256-bit thought, and there is not a 300-bit register to come help you.)

Our (failed) attempt above is just one example, which clearly shows, that we would hardly be able to come up with a fix on our side, though. Before opening multiple related issues I would like to start a discussion, whether or not there is interest in such journey and what would be a good approach.

Yes. We are interested. We can only do so much though. And we have a lot on our plates. So maybe we take this one issue at a time, and we see how far we go.

In any case, posting issues on the GitHub is always a good idea. It gives awareness to the problem, and it helps gathering help, ideas to fix the issues.

I am happy we go down this path, but I would recommend to take it easy.

Maybe, for gfortran, we should compile with the flags -mfpmath=sse -msse2 for testing purpose. I think this will force all computation to be done with 64-bit arithmetic. I am not sure though.

Given my limited experience with the project I would appreciate your effort and the chance to take your PR as a guideline for pot. future PRs from us... (if that's ok?)

Sure! Please, see #577.

@weslleyspereira Awesome! I am still checking, if this applies to CLAHQR the same. Will post my result ASAP (tomorrow)

Hello @langou !

Bold statement: If all computations are done using IEEE 64-bit arithmetic, then LAPACK should work.

Nice! I suppose, by 'work' we mean: when fed with data 'in a certain range' it will not overflow due to a given register size?

LAPACK does not expect 80-bit register to come help its computation at any times. The algorithms are designed with 64-bit arithmetic in mind. Now, as mentioned by @thijssteel, LAPACK is tested with various compilers/architectures, and these compilers/architectures use 80-bit registers at times, and we might think our algorithms only need 64-bit all along, but they do not, and they, in effect, do require an 80-bit.

We have not done anything systematic in our journey to go after these issues. In general, we are happy enough when the algorithms pass the test suite, and, if there is some help from 80-bit register, so be it.

Sounds very reasonable!

My rough estimation is that far less than 1% of the tests are affected by this or similar overflow issues (when using our compiler).

Oh my. 1%? That is a scary large number.

Well, likely it is 'far less' than that ;)

Portability to more platforms is one interest indeed. Another interest is extended precision with package such as GMP where, as I understand, the precision is fixed throughout the computation. (So for example you are 256-bit thought, and there is not a 300-bit register to come help you.)

Sounds interesting, but I cannot comment on this, since I lack of experience with such fixed precision attempts.

Yes. We are interested. We can only do so much though. And we have a lot on our plates. So maybe we take this one issue at a time, and we see how far we go.

I am still unsure what a good general approach would be. Bare with me, if my understanding is too naive. But isn't over-/underflow always depending on both: input data and algorithm? So instead of flooding the code with new conditionals testing for and new code for recovering from them we might instead decrease the 'allowed range' for the input data? I don't have the necessary insight into the effort required for either approach, though. So I cannot judge what would be more feasible.

In any case, posting issues on the GitHub is always a good idea. It gives awareness to the problem, and it helps gathering help, ideas to fix the issues.

Good. We will file issues as we go. I understand that it will be a challenge to come up with a fix without being able to reproduce an underflow. So, what information can we provide to make the issue more clear? Does the path down to the concrete underflow help? I.e.: providing iteration counts, current values of locals together with file names etc.?

I am happy we go down this path, but I would recommend to take it easy.

Same thing here! :)

One outcome of #577 is that LAPACK relies on the FORTRAN compiler to implement reasonably robust (under- / overflow) complex division and ABS(). I wonder if we should start maintaining a document, collecting such and similar requirements? They will be equally important and useful for anyone wanting to use LAPACK with other / new compilers, for compiler builders and in order to transfer parts or all of LAPACK algorithms to other languages?

Sure! It will be good to have this information well documented.

To begin with, I spent some time tracking (maybe) all divisions in the files LAPACK/SRC/z*.f (COMPLEX*16 algorithms) of the form

 REAL / COMPLEX   or   COMPLEX / COMPLEX

I found a total of 53 files. See the attached file: complexDivisionFound.code-search

  • For that, I used the REGEX expression in the Visual Studio Code:

    \n .*/^0-9(?!DBLE)(?!REAL)(?!MIN)(?!MAX)[^0-9]

Maybe, for gfortran, we should compile with the flags -mfpmath=sse -msse2 for testing purpose. I think this will force all computation to be done with 64-bit arithmetic. I am not sure though.

Yes, it should when using GCC but this flag should also be set by default on x86-64. The documentation excerpt below is for GCC 11 but much older GCC versions should exhibit the same behavior. Using the GNU Compiler Collection (GCC): 3.19.59 x86 Options

sse

Use scalar floating-point instructions present in the SSE instruction set. This instruction set is supported by Pentium III and newer chips, and in the AMD line by Athlon-4, Athlon XP and Athlon MP chips. The earlier version of the SSE instruction set supports only single-precision arithmetic, thus the double and extended-precision arithmetic are still done using 387. A later version, present only in Pentium 4 and AMD x86-64 chips, supports double-precision arithmetic too.

For the x86-32 compiler, you must use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For the x86-64 compiler, these extensions are enabled by default.

The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80 bits.

This is the default choice for the x86-64 compiler, Darwin x86-32 targets, and the default choice for x86-32 targets with the SSE2 instruction set when -ffast-math is enabled.

Example:
XEIGTSTZ < zec.in - fails due to underflow in ZLAHQR.
Steps to reproduce: ZGET37 -> knt == 31, ZHSEQR -> ZLAHQR -> at the end of the second QR step (ITS == 2) the following code causes underflow (on certain registers, see below)

TEMP = H( I, I-1 )
    IF( DIMAG( TEMP ).NE.RZERO ) THEN
        RTEMP = ABS( TEMP)    ! <-- underflow on TEMP = (~1e-0173, ~1e-0173)
        IF (RTEMP .EQ. RZERO) RTEMP = CABS1(TEMP)
        H( I, I-1 ) = RTEMP
        TEMP = TEMP / RTEMP
        IF( I2.GT.I )

Our compiler targets the .NET CLR. Its JIT decides to use SSE registers for ABS(TEMP), which leads to the underflow in the intermediate calculation of the magnitude. Ifort (as another example) uses floating point registers in this situation, hence does not underflow (because of its larger length: 80 bits). I am trying to get a clear(er) picture of what to expect from LAPACK regarding which precision / number range it requires from the compiler / processor at runtime.

To recap:

  • You transpiled half a million lines of Fortran77 into C#.
  • Testing of the transpiled code fails when using the .NET just-in-time compiler.
  • Testing of the vanilla LAPACK code succeeds when using the Intel Fortran compiler (ifort).
  • The observed difference between the two cases is the use of 80-bit intermediates by ifort which avoids an underflow.

Correct?

By default GCC generates only code for 64-bit float registers on x86-64 and on my machines usually all LAPACK tests pass except for one or two.

Does the Netlib LAPACK test suite pass when compiling with GCC?

edit: resolved https://github.com/Reference-LAPACK/lapack/pull/577#issuecomment-859496175

Maybe, for gfortran, we should compile with the flags -mfpmath=sse -msse2 for testing purpose. I think this will force all computation to be done with 64-bit arithmetic. I am not sure though.

i tried -mfpmath=sse -msse2 with GCC 11 on both MacOS and Linux: https://github.com/weslleyspereira/lapack/actions/runs/966071530. There were no additional errors when compared to the workflow without those flags: https://github.com/Reference-LAPACK/lapack/actions/runs/945060330. See #591

@hokb, could you reproduce the overflow issues you mentioned in https://github.com/Reference-LAPACK/lapack/issues/575#issuecomment-855880000 with GCC using SSE flags? Can you help me with that?

@weslleyspereira I haven't even tried GCC. All I have access to / a running setup for is ifort on Windows. It would take some days for me to get GCC up and running via cygwin to test (especially from my current holiday hotel room ... :| ) Let me know if you need me to take this challenge, though !
At least and according to https://godbolt.org/z/YYv5oPxe9 using the flags does not affect the code generated by gfortran. But of course only a test run will tell for sure ...

I don't use windows, but I do have it here. I will start by testing LAPACK with ifort on my Ubuntu and see what happens. Enjoy the holiday!

Was this page helpful?
0 / 5 - 0 ratings