Numpy: The PCG implementation provided by Numpy has significant, dangerous self-correlation

Created on 20 May 2020  ·  104Comments  ·  Source: numpy/numpy

The PCG generator used by Numpy has a significant amount self-correlation. That is, for each sequence generated from a seed there is a large number of correlated, nonoverlapping sequences starting from other seeds. By "correlated" I mean that interleaving two such sequences and testing the result you obtain failures that did not appear in each sequence individually.

The probability that two generators out of large set of terminals get two of those sequences is nonnegligible. Why this happens from a mathematical viewpoint is well known but it is explained here in detail: http://prng.di.unimi.it/pcg.pgp (see "Subsequences within the same generator").

To show this problem directly, I wrote this simple C program reusing the Numpy code: http://prng.di.unimi.it/intpcgnumpy.c . The program takes two 128-bit states of two generators (with the same LCG constant or "stream") in the form of high and low bits, interleaves their output and writes it in binary form. Once we send it through PractRand, we should see no statistical failure, as the two streams should be independent. But if try to start from two states with the same 64 lower bits, you get:

./intpcgnumpy 0x596d84dfefec2fc7 0x6b79f81ab9f3e37b 0x8d7deae980a64ab0 0x6b79f81ab9f3e37b | stdbuf -oL ~/svn/c/xorshift/practrand/RNG_test stdin -tf 2 -te 1 -tlmaxonly -multithreaded
RNG_test using PractRand version 0.94
RNG = RNG_stdin, seed = unknown
test set = expanded, folding = extra

rng=RNG_stdin, seed=unknown
length= 128 megabytes (2^27 bytes), time= 2.2 seconds
  Test Name                         Raw       Processed     Evaluation
  BCFN(0+0,13-2,T)                  R= +27.6  p =  1.0e-13    FAIL
  BCFN(0+1,13-2,T)                  R= +68.0  p =  2.3e-34    FAIL !!!
  BCFN(0+2,13-3,T)                  R= +90.8  p =  8.8e-43    FAIL !!!
  BCFN(0+3,13-3,T)                  R=+120.6  p =  6.9e-57    FAIL !!!!
  DC6-6x2Bytes-1                    R=  +8.9  p =  4.0e-5   mildly suspicious
  DC6-5x4Bytes-1                    R= +15.7  p =  4.3e-9   very suspicious
  [Low1/8]BCFN(0+0,13-4,T)          R= +11.6  p =  4.9e-5   unusual
  ...and 1074 test result(s) without anomalies

You can even go lower—you just need the same 58 lower bits:

./intpcgnumpy 0x596d84dfefec2fc7 0x0579f81ab9f3e37b 0x8d7deae980a64ab0 0x6b79f81ab9f3e37b | stdbuf -oL ~/svn/c/xorshift/practrand/RNG_test stdin -tf 2 -te 1 -tlmaxonly -multithreaded

[...]
rng=RNG_stdin, seed=unknown
length= 32 gigabytes (2^35 bytes), time= 453 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/16]FPF-14+6/32:cross        R= +11.6  p =  4.0e-10   VERY SUSPICIOUS
  [Low1/32]FPF-14+6/32:cross        R= +16.5  p =  3.2e-14    FAIL
  [Low1/32]FPF-14+6/16:cross        R= +12.8  p =  3.8e-11   VERY SUSPICIOUS
  [Low1/64]FPF-14+6/64:cross        R=  +6.8  p =  4.8e-6   mildly suspicious
  [Low1/64]FPF-14+6/32:cross        R=  +6.0  p =  1.9e-5   unusual
  [Low1/64]FPF-14+6/16:cross        R=  +5.5  p =  5.8e-5   unusual
  [Low4/32]FPF-14+6/64:all          R=  +5.8  p =  5.9e-5   unusual
  [Low4/32]FPF-14+6/32:(0,14-0)     R=  +7.7  p =  1.0e-6   unusual
  [Low4/32]FPF-14+6/32:(1,14-0)     R=  +7.7  p =  9.1e-7   unusual
  [Low4/32]FPF-14+6/32:all          R=  +6.5  p =  1.3e-5   unusual
  [Low4/64]FPF-14+6/64:all          R=  +5.9  p =  5.1e-5   unusual
  [Low4/64]FPF-14+6/64:cross        R=  +8.2  p =  3.0e-7   suspicious
  [Low4/64]FPF-14+6/32:(0,14-0)     R=  +7.6  p =  1.0e-6   unusual
  [Low8/64]FPF-14+6/64:(0,14-0)     R= +17.0  p =  2.2e-15    FAIL
  [Low8/64]FPF-14+6/64:(1,14-0)     R=  +9.1  p =  5.1e-8   mildly suspicious
  [Low8/64]FPF-14+6/64:all          R= +12.7  p =  2.1e-11   VERY SUSPICIOUS
  [Low8/64]FPF-14+6/32:(0,14-0)     R= +12.8  p =  1.7e-11   VERY SUSPICIOUS
  [Low8/64]FPF-14+6/32:all          R= +11.0  p =  9.3e-10   VERY SUSPICIOUS
  ...and 1696 test result(s) without anomalies

Note that to get more the 50% probability that two generators start from two correlated seed (chosen at random) you need just about half a million generators starting at random (birthday paradox). And if you consider the probability that they do not exactly start from the same state, but have significant overlapping correlating sequences, you need much less.

Any sensible generator from the literature will not behave like that. You can choose adversarially any two starting states of MRG32k3a, SFC64, CMWC, xoshiro256++, etc., and as long as you generate nonoverlapping sequences you will not see the failures above. This is a major drawback that can pop up when a number of devices uses the generator and one assumes (as it should be) that pairwise those sequences should not show correlation. The correlation can induce unwanted behavior that is hard to detect.

Please at least document somewhere that the generator should not be used on multiple terminals or in a highly parallel environment.

The same can happen with different "streams", as the sequences generated by an LCG by changing the additive constant are all the same modulo a change of sign and an additive constant. You can see some discussion here: https://github.com/rust-random/rand/issues/907 and a full mathematical discussion of the problem here: https://arxiv.org/abs/2001.05304 .

numpy.random

All 104 comments

@imneme, @bashtage, @rkern would be the authorities here, but I think we have gone over this and it is why we preferred the SeedSequence.spawn interface over the jumped one. For instance there was this discussion when we were discussing the API. Please check the advice here https://numpy.org/devdocs/reference/random/parallel.html and suggest improvements as needed.

@mattip This has nothing to do with jumping.

I think in practice it is difficult to make wholesale changes, although improved documentation is always a good idea.

I would probably recommend AESCounter for anyone with AES-NI or SPECK128 for anyone without in highly parallel settings.

The same can happen with different "streams", as the sequences generated by an LCG by changing the additive constant are all the same modulo a change of sign and an additive constant.

Can you quantify this? I can replicate the failures using the same increment, but we seed the increment as well as the state, and I have not yet observed a failure with two different random increments. If the increments also have to be carefully constructed, then that would affect the practical birthday collision frequency.

https://gist.github.com/rkern/f46552e030e59b5f1ebbd3b3ec045759

❯ ./pcg64_correlations.py --same-increment | stdbuf -oL ./RNG_test stdin64 -tf 2 -te 1 -tlmaxonly -multithreaded
0x56b35656ede2b560587e4251568a8fed
0x93526034ed105e9e587e4251568a8fed
[
    {
        "bit_generator": "PCG64",
        "state": {
            "state": 115244779949650410574112983538102603757,
            "inc": 137507567477557873606783385380908979143
        },
        "has_uint32": 0,
        "uinteger": 0
    },
    {
        "bit_generator": "PCG64",
        "state": {
            "state": 195824235027336627448689568147458133997,
            "inc": 137507567477557873606783385380908979143
        },
        "has_uint32": 0,
        "uinteger": 0
    }
]
RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x4bf19f7b
test set = expanded, folding = extra

rng=RNG_stdin64, seed=0x4bf19f7b
length= 128 megabytes (2^27 bytes), time= 3.0 seconds
  Test Name                         Raw       Processed     Evaluation
  BCFN_FF(2+0,13-3,T)               R= +59.9  p =  3.8e-28    FAIL !!!       
  BCFN_FF(2+1):freq                 R= +89.0  p~=   6e-18     FAIL !         
  BCFN_FF(2+2):freq                 R= +39.6  p~=   6e-18     FAIL !         
  BCFN_FF(2+3):freq                 R= +14.6  p~=   6e-18     FAIL !         
  BCFN_FF(2+4):freq                 R= +10.3  p~=   5e-11   very suspicious  
  DC6-9x1Bytes-1                    R=  +7.1  p =  5.6e-4   unusual          
  DC6-6x2Bytes-1                    R= +18.9  p =  1.0e-10   VERY SUSPICIOUS 
  DC6-5x4Bytes-1                    R= +11.2  p =  1.4e-6   suspicious       
  [Low4/16]BCFN_FF(2+0):freq        R= +19.5  p~=   6e-18     FAIL !         
  [Low4/16]FPF-14+6/16:all          R=  +5.6  p =  1.0e-4   unusual          
  [Low4/16]FPF-14+6/4:all           R=  +5.9  p =  4.6e-5   unusual          
  [Low4/32]BCFN_FF(2+0):freq        R=  +6.5  p~=   2e-5    unusual          
  [Low8/32]BCFN_FF(2+0):freq        R= +15.1  p~=   6e-18     FAIL !         
  [Low8/32]FPF-14+6/32:all          R=  +8.4  p =  2.5e-7   very suspicious  
  [Low8/32]FPF-14+6/32:all2         R=  +9.0  p =  7.8e-5   unusual          
  [Low8/32]FPF-14+6/16:(0,14-0)     R= +12.4  p =  4.5e-11   VERY SUSPICIOUS 
  [Low8/32]FPF-14+6/16:all          R= +15.5  p =  5.2e-14    FAIL           
  [Low8/32]FPF-14+6/16:all2         R= +41.4  p =  2.6e-16    FAIL !         
  [Low8/32]FPF-14+6/4:(0,14-0)      R=  +6.9  p =  5.9e-6   unusual          
  [Low8/32]FPF-14+6/4:all           R=  +7.9  p =  6.6e-7   suspicious       
  ...and 871 test result(s) without anomalies

OK, I'll try again.

There are no multiple streams in an LCG with a power-of-2 modulus. Many believed it in the early days, and there are even long old papers claiming to do interesting stuff with those "streams", but it is has been known for decades that the orbits you obtain by changing the constants are _all the same modulo an additive constant and possibly a sign change_. The farthest I can trace it is

Mark J. Durst, Using linear congruential generators for parallel random number generation,
1989 Winter Simulation Conference Proceedings, IEEE Press, 1989, pp. 462–466.

So, I wrote another program http://prng.di.unimi.it/corrpcgnumpy.c in which you can set:

  • An initial state for a PRNG.
  • An initial state for another PRNG.
  • An arbitrary "stream constant" for the first PRNG.
  • An arbitrary "stream constant" for the second PRNG (they should be both even or both odd; this restriction can be removed with some additional fiddling).
  • A fixed number of lower bits that we will set adversarially in the second PRNG, essentially in such a way that it starts with the same bits of the first PRNG. The rest of the bits will be taken from the initial state for the second PRNG you have provided.

So this is _exactly_ the setting of the first program, but you can also choose the constants.

./corrpcgnumpy 0x596d84dfefec2fc7 0x6b79f81ab9f3e37b 0xac9c8abfcb89f65f 0xe42e8dff1c46de8b 0x8d7deae9efec2fc7 0x6b79f81ab9f3e37b 0x06e13e5e8c92c843 0xf92e8346feee7a21 56 | stdbuf -oL ~/svn/c/xorshift/practrand/RNG_test stdin -tf 2 -te 1 -tlmaxonly -multithreaded

rng=RNG_stdin, seed=unknown
length= 4 gigabytes (2^32 bytes), time= 113 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/8]BCFN(0+0,13-1,T)          R= +27.2  p =  4.0e-14    FAIL
  [Low1/8]DC6-6x2Bytes-1            R= +10.9  p =  4.4e-6   suspicious
  [Low1/64]DC6-5x4Bytes-1           R=  -6.4  p =1-1.4e-4   unusual
  [Low8/64]FPF-14+6/64:(0,14-0)     R=  +8.4  p =  2.2e-7   mildly suspicious
  [Low8/64]FPF-14+6/64:all          R=  +8.7  p =  1.2e-7   suspicious
  [Low8/64]FPF-14+6/32:(0,14-0)     R= +10.2  p =  5.1e-9   suspicious
  [Low8/64]FPF-14+6/32:all          R=  +9.4  p =  2.7e-8   very suspicious
  [Low8/64]FPF-14+6/16:all          R=  +5.8  p =  6.4e-5   unusual
  ...and 1439 test result(s) without anomalies

So there are _at least_ 2^72 correlated subsequences, no matter how you choose the "stream constants", exactly as in the same-constant case.

And we're given a ridiculous amount of slack to the generator: even if instead of the exact starting point I'm calculating you would use a state a little bit before or after, correlation would show up anyway. You can modify easily the program with an additional parameter to do that.

I repeat, once again: no existing modern generator from the scientific literature has this misbehavior (of course, a power-of-2 LCG has this behavior, but, for God's sake, that's _not_ a modern generator).

Sabastiano's critiques of PCG are addressed in this blog post from 2018.

The short version is that if you're allowed to contrive specific seeds, you can show “bad looking” behavior out of pretty much any PRNG. Notwithstanding his claim that PCG is somehow unique, actually PCG is pretty conventional — PCG's streams are no worse than, say, SplitMix's, which is another widely used PRNG.

That is entirely false. To prove me wrong, show two correlated nonoverlapping sequences from MRG32k3a or xoshiro256++.

I never said non-overlapping. Show me a currently available test for xoshiro256++. that two seeds avoid overlap.

In contrast, I do have a test for PCG that shows that the “correlations” you showed are essentially a form of overlap.

I can't fight FUD like "essentially" and "a form", but I modified http://prng.di.unimi.it/intpcgnumpy.c so that initially it iterates each PRNG 10 billion times, and exits with an error message if the generated sequence traverses the initial state of the other PRNG. This guarantees that the first 160GB of data into Practrand come from non-overlapping sequences:

./intpcgnumpy 0x596d84dfefec2fc7 0x0579f81ab9f3e37b 0x8d7deae980a64ab0 0x6c79f81ab9f3e37b | stdbuf -oL ~/svn/c/xorshift/practrand/RNG_test stdin -tf 2 -te 1 -tlmaxonly -multithreaded
[...]
rng=RNG_stdin, seed=unknown
length= 64 gigabytes (2^36 bytes), time= 926 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/8]FPF-14+6/64:(0,14-0)      R=  +8.8  p =  8.7e-8   mildly suspicious
  [Low1/8]FPF-14+6/64:all           R=  +6.3  p =  2.1e-5   unusual          
  [Low1/16]FPF-14+6/64:(0,14-0)     R=  +7.6  p =  1.1e-6   unusual          
  [Low1/16]FPF-14+6/64:(1,14-0)     R=  +8.3  p =  2.9e-7   mildly suspicious
  [Low1/16]FPF-14+6/64:all          R=  +8.0  p =  5.8e-7   suspicious       
  [Low1/16]FPF-14+6/32:all          R=  +7.1  p =  3.9e-6   mildly suspicious
  [Low1/64]FPF-14+6/32:cross        R=  +7.1  p =  2.6e-6   mildly suspicious
  [Low4/32]FPF-14+6/64:(0,14-0)     R= +13.5  p =  4.3e-12   VERY SUSPICIOUS 
  [Low4/32]FPF-14+6/64:all          R=  +9.0  p =  5.9e-8   very suspicious  
  [Low4/64]FPF-14+6/64:(0,14-0)     R= +11.4  p =  3.8e-10  very suspicious  
  [Low4/64]FPF-14+6/64:all          R=  +8.0  p =  5.3e-7   suspicious       
  [Low4/64]FPF-14+6/32:(0,14-0)     R= +10.3  p =  3.6e-9   suspicious       
  [Low4/64]FPF-14+6/32:all          R=  +6.1  p =  3.2e-5   unusual          
  [Low8/64]FPF-14+6/64:(0,14-0)     R= +18.6  p =  8.4e-17    FAIL           
  [Low8/64]FPF-14+6/64:(1,14-0)     R= +11.4  p =  3.9e-10  very suspicious  
  [Low8/64]FPF-14+6/64:(2,14-0)     R=  +8.3  p =  2.8e-7   mildly suspicious
  [Low8/64]FPF-14+6/64:all          R= +15.3  p =  6.9e-14    FAIL           
  [Low8/64]FPF-14+6/32:(0,14-0)     R=  +7.8  p =  7.1e-7   unusual          
  [Low8/64]FPF-14+6/32:(1,14-0)     R=  +7.2  p =  2.7e-6   unusual          
  [Low8/64]FPF-14+6/32:all          R=  +5.8  p =  6.9e-5   unusual          
  ...and 1786 test result(s) without anomalies

This particular initialization data has just 56 lower fixed bits, so one can generate 2^72 correlated sequences by flipping the higher bits. The statistical failures happen after just 64GB of data, showing that overlaps are not responsible for the correlation. It is possible that with other specific targeted choices overlap happens before 64GB, of course—this is a specific example. But it is now easy to check that overlapping is not the problem—the generator has just a lot of undesirable internal correlation.

Please respect the code of conduct. Try to keep your comments in tone with the directives to be "empathetic, welcoming, friendly, and patient" and "be careful in the words that we choose. We are careful and respectful in our communication".

I never said non-overlapping. Show me a currently available test for xoshiro256++. that two seeds avoid overlap.

Well, it's trivial: decide the length of the stream, iterate, and check that the two streams do not cross the initial state. It's the same code I used to show the correlated PCG streams in the program http://prng.di.unimi.it/intpcgnumpy.c do not overlap.

Notwithstanding his claim that PCG is somehow unique, actually PCG is pretty conventional — PCG's streams are no worse than, say, SplitMix's, which is another widely used PRNG.

IMHO, the self-correlation within PCG is much worse. There is no result for the additive generator underlying a SplitMix instance analogous to Durst's dramatic 1989 results about LCGs.

But the very mild problems of SplitMix are known, and JEP 356 will provide a new class of splittable generators, LXM, trying to address those problems. It would be time to move on and replace PCG, too, with something less flawed.

The underlying issue is known for both generators, and it is due to the lack of state mix. If you change bit _k_ of the state of one of those generators, the change will never propagate below bit _k_. This does not happen in LCGs with prime modulus, in F₂-linear generators, CMWC generators, etc. All other generators try to mix their state as quickly and as much as possible.

Equating the problems of PCG and SplitMix is a red herring. SplitMix has a very simple underlying generator, just additive, but on top of that there is a scrambling function that is very powerful: it is Appleby's 64-bit finalizer of the MurmurHash3 hash function, which has been widely used in a number of contexts and has been improved by Stafford (http://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html). The constants of the function have been trained to have specific, measurable avalanching properties. Even changes in a small number of bits tend to spread over all the output. In other words, SplitMix stands on the shoulder of giants.

On the contrary, the LCG underlying PCG generators has the same lack-of-mixing problems, but the scrambling functions are just a simple sequence of arithmetic and logical operation assembled by the author without any theoretical or statistical guarantee. If they had been devised taking care of the fact that all sequences of the underlying LCG are the same modulo an additive constant and possibly a sign change, it would have been possible to address the problem.

But the author had no idea that the sequences were so easily derivable one from another. This can be easily seen from this statement in Section 4.2.3 of the PCG technical report (https://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf):

"Every choice of _c_ results in a different sequence of numbers that has none of its pairs of successive outputs in common with another sequence."

This is taken as proof that the sequences are different, that is that the underlying LCG provides multiple streams. Durst's 1989 negative results about this topic do not appear anywhere in the paper. As I remarked earlier, by those results all such sequences are the same, modulo an additive constant and possibly a sign change (for LCGs with power-of-2 modulus of maximum potency, as it happens in PCG).

I'm sure not quoting Durst's results is a _bona fide_ mistake, but the problem is that once you are convinced the underlying LCG you are using provides "streams" that are "different" in some sense, when they are not, you end up with a generator like PCG in which for each subsequence there are 2^72 non-overlapping, correlated subsequences, even if you change the "stream".

Thank you all for your input. For the moment, I am not interested in binary judgments like "PCG is good/bad". Please use your own forums for such discussions. What is on-topic here is what numpy will do, and that final judgment belongs to the numpy developers. We do appreciate the expertise that you all bring to this discussion, but I want to focus it on the underlying facts rather than the final judgement. I especially appreciate quantitative statements that give me an idea of the amount of headroom that we have. If my earlier judgments were wrong, it was because I jumped to the judgment too soon, so I would appreciate your assistance in avoiding that again. Thank you.

Note that to get more the 50% probability that two generators start from two correlated seed (chosen at random) you need just about half a million generators starting at random (birthday paradox).

@vigna Can you walk me through this calculation? The birthday collision calculation that I am familiar with gives the 50% chance of an n-bit collision at 2**(n/2) items (give or take a factor of 2) . Half a million is 2**19, so you seem to be claiming that the dangerous correlations start at around a 40-bit collision in the lower bits, but I have not seen evidence that this is practically observable. I have tested a pair sharing the lower 40 bits and got to 16 TiB in PractRand before cancelling the test. If you have observed a failure with a 40-bit collision, how many TiB did you have to test to see it?

I am convinced that changing the increment doesn't affect the probability of collision. Further discussion of the merits of "PCG streams" is off-topic. Using that discussion as an excuse to repeatedly hammer on "the author" is especially unwelcome and treads on our code of conduct. Persisting will mean that we will have to proceed without your input. Thank you.

@imneme This seems to be the related to the issues with jumping by a multiple of a large power of 2. When I construct a pair of PCG64 instances with the same increment and sharing the lower n bits, the distance that I calculate between the two is a multiple of 1 << n. It does appear that your stronger DXSM output function appears to resolve this manifestation as well. I've tested a pair of PCG64DXSM instances that share an increment and the lower 64-bits of state out to 2 TiB without issue.

OK, this is embarrassing: it was half a _billion_, not half a _million_. A single letter can make a big difference. I apologize for the slip.

But, as I said early, this is the probability of hitting exactly the same starting state, not the probability of a significant overlap of correlated subsequences. Personally I prefer to use PRNGs without correlated subsequences, since there are plenty of them, but, as you rightly say, the decision is only yours.

Fixing the scrambling function so that it has better mixing properties sounds like a perfectly reasonable solution.

My post was just meant to be a clarification of the structural differences between PCG and SplitMix, since a previous post claimed they had similar problems, and I don't think that is a correct statement. You cannot write a program like http://prng.di.unimi.it/corrpcgnumpy.c for SplitMix.

@rkern, you asked:

@imneme This seems to be the related to the issues with jumping by a multiple of a large power of 2. When I construct a pair of PCG64 instances with the same increment and sharing the lower n bits, the distance that I calculate between the two is a multiple of 1 << n. It does appear that your stronger DXSM output function appears to resolve this manifestation as well. I've tested a pair of PCG64DXSM instances that share an increment and the lower 64-bits of state out to 2 TiB without issue.

Thanks for finding and linking back to the discussion thread from last year. Yes, as Sebastiano notes in his response,

Fixing the scrambling function so that it has better mixing properties sounds like a perfectly reasonable solution.

XSL-RR is at the weaker end of things. In contrast, both the original RXS-M output function from the PCG paper, and the new DXSM output function do more in the way of scrambling, and so don't show these kinds of issues. DXSM (added to the PCG source in this commit last year) was specifically designed to be stronger than XSL-RR but have similar time performance (c.f., RXS-M, which is slower). I tested DXSM fairly hard last year, but 67 days into the run we had an extended power outage that took down the server (the UPS battery drained) and ended the test run, but at that point it had proven itself pretty well in both normal testing (128 TB of output tested) and jumps of 2^48 (64 TB of output tested, since that runs slower).

If, even without designing DXSM, RXS-M would have taken care of the issue, one question is why I ever used the weaker XSL-RR permutation instead — why not always use very strong bit-scrambling in the output function? The answer is that it basically comes down to cycles. Speed matters to people, so you try to avoid doing a lot more scrambling than you need to.

This is an issue Sebastiano is familiar with, because his approach and mine have much in common. We each take a long established approach that would fail modern statistical tests (LCGs in my case, and Marsaglia's XorShift LFSRs in his) and add a scrambling output function to redeem it. We both strive to make that output function cheap, and we've both been caught out a little bit where deficiencies in the underlying generator that we're trying to mask with our output function nevertheless show through. In his case, it's linearity issues which have shown through.

But in somewhat recent work that pleased me a lot, he's also shown how many LFSR-based designs have hamming-weight issues (reflecting a longstanding concern of my own) that are inadequately masked by their output functions. His own generators pass his test, so that's something, but back in 2018 when I was looking at his new xoshiro PRNG it seemed to me that hamming-weight issues from the underling generator did make it through his output function. He's since revised xoshiro with a new output function, and I hope that does the trick (he made some other changes too, so perhaps that one also fixes the repeats issue, highlighted by this test program?).

As for his correlation programs, back in 2018 when he put a critique of PCG in on his website that included programs he'd written with various issues (like contrived seeding, etc.), I wrote a response that contained a bunch of similar programs for other long established PRNGs, including corrsplitmix2.c which creates correlated streams in SplitMix. I'm not quite sure what Sebastian means when his says it can't be done, but I will admit I haven't had a chance to look closely at his test program to see if his new one is substantively different from the ones he wrote a couple of years ago.

Please forgive my lack of understanding - but could I ask someone for a summary conclusion at this stage? Are there realisitic circumstances is which the default PCG is unsafe? What are the arguments for switching the default?

The easy path is to add some documentation about high (ultra-high?) -dimensional applications and the probability of correlated sequences.

A harder path would be to replace the output function which would break the stream. I'm not sure how strong a promise default_rng makes. The docs don't seem to warn that this may change so it would probably need a deprecation cycle to change. This would require adding in the new output function as a standalone bit generator (or configurable from PCG64, which would be more sensible) and then warning users that it will be changing after year XXXX/release Y.ZZ.

The hardest path would be to find a new default rng. It wasn't easy the first time and I don't think anything has changed to move the needle in any particular direction over the past 18 months.

I opened gh-16493 about the default_rng() modification, I do not think it is related to this issue, not even sure we have to discuss it, we probably already had set down the rules long ago and I just don't remember.


I do not claim to understand this discussion fully, but it seems there are two things to figure out:

  1. Being certain that we have enough headway, I have to trust Robert on this and right now it sounds to me like it should be fine to the best of our knowledge? (I.e. the probability of an actual collision is probably embarrassingly low even in environments magnitudes larger than anything NumPy may be used in? Whether or not it may warrant changing the default in the future is a different issue.)
  2. We state:

    and supports [...] as well as :math:2^{127} streams

    which, not knowing exactly where the number comes from, sounds like it may be a slight overstatement on our part and we could consider adjusted it slightly to be perfectly correct? Or link to some external resource giving additional details?

The easiest thing to do right now would be to add a PCG64DXSM BitGenerator, the variant of PCG64 with the "cheap multiplier" and stronger DXSM output function. I think everyone agrees that that's a step up from the XSL-RR output function we have now in our PCG64 implementation, performing better statistically without damage to runtime performance. It's a straightforward upgrade in the niche that PCG64 serves for the BitGenerators that we provide. I think we should add it alongside PCG64.

Incidentally, I prefer that it be a separate named BitGenerator rather than an option to the PCG64 constructor. Such options are great in randomgen, whose purpose is to provide a lot of different algorithms and variants, but for numpy, I think we want our selections to be "grab-and-go" as much as we can.

I don't think we really settled the policy for making changes to what default_rng() provides. When I proposed it, I brought up the notion that one of the reasons I preferred it to just putting its functionality into the Generator() constructor was that we could deprecate and move to a differently-named function if we needed to. However, at that time we were considering that default_rng() might need to expose a lot of details of the underlying BitGenerator, which we subsequently avoided. Because PCG64DXSM exposes the same API (.jumped() in particular) as PCG64, the only consideration we would have is that using as the new default would change the bitstream. I think it would be reasonable for us to follow the same timeline as any other modification to the stream coming from Generator methods per NEP 19 (i.e. on X.Y.0 feature releases). We can choose to be a little more cautious, if we want, and first expose PCG64DXSM as an available BitGenerator in 1.20.0 and document (but not warn(), too noisy to no effect) that default_rng() will change to using it in 1.21.0.

As part of adding the new BG it would be good to update the notes to PCG64 to start serving as a guide and to provide a rationale for preferring the newer variant.

  1. Being certain that we have enough headway, I have to trust Robert on this and right now it sounds to me like it should be fine to the best of our knowledge? (I.e. the probability of an actual collision is probably embarrassingly low even in environments magnitudes larger than anything NumPy may be used in? Whether or not it may warrant changing the default in the future is a different issue.)

That's probably a little too glib. It depends on how many streams, how much data you pull from each stream, and what your risk tolerance for a birthday collision is. I haven't gotten around to crunching that math into a comprehensible paragraph yet to make easy-to-follow recommendations yet, which is why I haven't revisited this Github issue in a while. I don't think it's a hair-on-fire issue that needs to be fixed _right now_, though.

I'll write something longer later, but as I see it in this thread we've been retreading the ground we went over last year. Nothing has changed other than Sebastiano discovered that NumPy had shipped PCG. The analysis from the NumPy team last year was more in-depth, and considered more plausible scenarios.

My preference would be to upgrade the default as quickly as reasonably possible - just to reduce confusion. I mean, not wait for a deprection cycle.

@imneme - thanks much - I'd find a longer thing very useful.

Probably the best blast from the past post about it is this one. I think it's certainly worth a read. You can scroll back up in the thread from there to see us talking about these issues. It was about PCG 32.

I had in my head what I wanted to say, but I find looking at the posts from a year ago, I've said it all already, both here and elsewhere (my blog (2017), reddit, my blog again (2018), and in NumPy discussions, etc.)

About streams and their self-similarity (for which @rkern wrote a stream-dependence tester), I wrote last year:

As noted in my blog post that was mentioned earlier, PCG's streams have a lot in common with SplitMix's.

Regarding @mdickinson's graph, for _every_ PRNG that allows you to seed its entire state, including counter-based cryptographic ones, we can contrive seedings where we'd have PRNGs whose outputs were correlated in some way (the easiest way to do so is to make PRNG states that are a short distance apart, but often we can do other things based on an understanding of how they work). And although PRNGs that don't allow full-state seeding can avoid this issue, doing so just introduces a new one, only providing practical access to a tiny fraction of their possible states.

The right way to think of streams is just more random state that needs to be seeded. Using small values like 1,2,3 is generally bad idea for any seeding purposes for _any_ PRNG (because if everyone does favors these seeds, their corresponding initial sequences will be overrepresented).

We can choose not to call it a stream at all and just call it state. That's what Marsaglia did in XorWow. If you look at the code, the Weyl sequence counter doesn't interact with the rest of the state at all, and, like LCGs, and variations in the initial value really just amount to an added constant.

SplitMix's, PCG's and XorWow's streams are what we might call “stupid” streams. They constitute a trivial reparameterization of the generator. There is value in this, however. Suppose that without streams, our PRNG would have an interesting close repeat of 42, where 42 crops up several times in quick succession and only does this for 42 and no other number. With stupid “just an increment” or “just an xor” streams, we'll actually avoid hardwiring the weird repeat to 42; all numbers have a stream in which they are they weird repeat. (For this reason, the fix I'd apply to repair the close-repeat problems in Xoshiro 256 is to mix in a Weyl sequence.)

(I then went into more depth in this comment and in this one.)

I'd say that the key take-away is that for almost _any_ PRNG, with a bit of time and energy you can contrive pathological seedings. PCG is in a somewhat unusual position that it has someone who enjoys working plausible looking seedings for PCG specifically that have a hand-crafted pathology (i.e., Sebastiano). As a result of that work, I've turned around and done the same for both his PRNGs and for other longstanding ones.

In general, you want to initialize the PRNG state with something that looks “kinda random”. That's pretty universal, even if people want to do otherwise. For example, for any LFSR PRNG (XorShift-style, Mersenne Twister, etc.), you must not initialize it with an all-zeros state because it'll just stay stuck there. But even states that are mostly-zero are often problematic for LFSRs (a phenomenon known as zeroland, and why the C++ folks made seed_seq.). If you want to play the “let's do some contrived seeding” game, it's not hard to create collection of initializations for 100 LFSRs and have them all be 1K away from hitting zero-land. The contrived initializations will all look innocent enough, but they'll all hit this weird drop in hamming weight at the same time.

If you're using a PCG generator that was initialized with reasonable entropy, it's fine. If you want to initialize it with junk like 1, 2, 3, that's actually problematic with any PRNG. And with PCG, 99.9% of the time, even using something kinda junky would be fine. It doesn't have anything like the kind of issues that LFSRs have.

But DXSM is certainly stronger. I think it's better or I wouldn't have made it, but it's worth having some perspective and realizing that in practice users aren't going to run into problems with the classic PCG64 implementation.

I'd like to separate out the criticism/defense of PCG64's streams (via the LCG increment) from the present discussion. While there's a certain duality involved due to the mathematics of the LCG, it's not the core issue that was originally brought up here. Also, there's more detail here to be considered than was present in Sebastiano's original critique, your response, or the old mega-thread. Perhaps the connection is more obvious to the experts who have spent more time on the math, but the practical consequences are now clearer to me, at least.

I'd say that the key take-away is that for almost _any_ PRNG, with a bit of time and energy you can contrive pathological seedings.

Granted, but it's not the binary can/can't that drives the decision in front of me. If I draw too many numbers from a finite PRNG, eventually PractRand will suss it out. That binary fact doesn't invalidate that PRNG algorithm. Moving away from that binary and establishing the concept of headroom was one of the things that I really appreciated about the original PCG paper. Given an adversarially-generated pathology, we can take a look at how often that pathology could arise randomly from good entropy-seeding. I want to quantify that, and turn it into practical advice for users.

Given two states that share the lower 58 bits and the same increment (we'll put a pin in that), interleaving PCG64 XSL-RR instances from those states demonstrates practically observable failures in PractRand at around 32 GiB. I think it's reasonable to want to avoid that. So let's take that as our benchmark and look at how often that arises with good entropy-seeding. Fortunately, this adversarial scheme is amenable to probabilistic analysis (not all are so friendly). For n instances, the probability of any 2 sharing the same lower 58 bits is n**2 / 2**58, give or take a factor of 2 for double-counting. So at half a billion instances, odds are good that there is one such pairing that would fail PractRand if interleaved. Half a billion is a lot! In my judgment, we'll probably never see a numpy program that tries to create that many PCG64 instances. numpy would likely be the wrong tool, then.

I think it's also reasonable to want to avoid initial states whose subsequent draws will _cross_ any of the lower-58-bit collision states of the other initial states. I'm still trying to think through the logic on that, but I think the length affects the probability linearly rather than quadratically. If I'm right, and I want to draw 16 GiB from each instance (2**28 draws), which is how much we drew from each of the pair that showed PractRand failures, then I can only work with about 2**15 instances, or about 32k, before it becomes quite likely to observe a crossing. That's still quite a lot of instances for Python! And the total amount of data generated is about half a petabyte, which is a lot! But it's on the horizon of practicality, and if I want to keep the probability low, not just below half, I have to go lower on one of those. I'm not particularly concerned by these numbers; I don't think any real numpy programs are likely to run into problems using PCG64 with the XSL-RR output function. But some applications may start to get close (large distributed reinforcement learning runs, for example).

Let's take that increment pin out and address it. I think it's fair to say that with the XSL-RR output function, also entropy-seeding the increment in addition to the state does not change this particular analysis. It seems that for any given pair of entropy-seeded increments, there's the same number of practically-colliding states. The concrete procedure for deliberately constructing those states looks more complicated than bashing in the same lower 58 bits, but it seems like the number of colliding states is the same, so the probability calculations remain the same. This isn't intrinsic to the PCG scheme in general. The DXSM output function appears to be strong enough that changing the increment (even with a simple +2) seems to be sufficient to resist even the worst-case state for the underlying LCG (when the distance metric gives 0), at least as far as I've bothered to test with PractRand.

I want to end by reiterating what we all seem to be in perfect agreement about: PCG64DXSM is a good idea. If nothing else, its improved statistical properties simplify the mental models that I feel compelled to document, and anything that means I have to write less documentation is good in my book.

Streams are still somewhat relevant because the issue only shows up if we have the generators on the same stream.

But under what circumstances would they have the same lower 58 bits and be on the same stream? Is there a use case where this would happen?

The one somewhat realistic case I know of is the one we talked about last year (when we talked about jumped), and I talked about in this post which I linked to earlier.

Streams are still somewhat relevant because the issue only shows up if we have the generators on the same stream.

Unfortunately, that's not the case for XSL-RR. Let's consider two PCG64 XSL-RR instances. We entropy-seed the increments arbitrarily and entropy-seed one of the states. We can construct 2**70 bad states for the other PCG64 instance that fail PractRand in the same way as the same-lower-58-bits-state same-increment failure. It's just more complicated to do than in the same-increment case. Instead of sharing the lower 58-bits as the first state with the different increment, it shares the lower 58-bits of the state that is 0 distance (according to your LCG distance measure) from the first instance, accounting for the increments. I have a constructive proof (Python code), but I have to go to bed now and clean it up tomorrow.

@rkern, good point. I'll admit, I haven't tested that scenario to see how it fares.

I'd say that the key take-away is that for almost _any_ PRNG, with a bit of time and energy you can contrive pathological seedings. PCG is in a somewhat unusual position that it has someone who enjoys working plausible looking seedings for PCG specifically that have a hand-crafted pathology (i.e., Sebastiano). As a result of that work, I've turned around and done the same for both his PRNGs and for other longstanding ones.

As I have already remarked, this is false. I know of no example of a pair of correlated, non-overlapping sequences, say, from xoshiro256++ as you can find easily within PCG.

PRNGs which quickly mix their entire state do not have this problem. If you can provide a program generating two non-overlapping sequences from xoshiro256++ that are correlated, like the examples I posted here, please do so.

As for his correlation programs, back in 2018 when he put a critique of PCG in on his website that included programs he'd written with various issues (like contrived seeding, etc.), I wrote a response that contained a bunch of similar programs for other long established PRNGs, including corrsplitmix2.c which creates correlated streams in SplitMix. I'm not quite sure what Sebastian means when his says it can't be done, but I will admit I haven't had a chance to look closely at his test program to see if his new one is substantively different from the ones he wrote a couple of years ago.

The program quoted above _chooses the streams_. It is obviously easy to write it.

But that has nothing to do with PCG's problems. The program I provided let _user_ choose the streams, and then shows correlation.

I invite, again, @inmeme to provide a program for SplitMix in which the user can select two different streams arbitrarily, an arbitrary initial state of the first generator, and _then_ the program finds a correlated sequence in the other generator, like http://prng.di.unimi.it/corrpcgnumpy.c does.

Letting the user choose the stream arbitrarily shows a much stronger form of correlation.

As I have already remarked, this is false. I know of no example of a pair of correlated, non-overlapping sequences, say, from xoshiro256++ as you can find easily within PCG.

It seems like we're talking past each other here. I'm didn't say I could come up with correlated, non-overlapping sequences for any PRNG, I said I could come up with pathological seedings in general, as shown with the various correlation programs I'd written previously, and others such as the bad repeats demonstration program for Xoshiro**.

Also, PRNGs that don't mix their entire state have a long history, including XorWow, the generators in numerical recipes, etc. Sebastiano's argument represents point of view but his argument would say that somehow Marsaglia made XorShift _worse_ in XorWow by adding a Weyl sequence, since it creates a vast number of similar generators.

It seems like we're talking past each other here. I'm didn't say I could come up with correlated, non-overlapping sequences for any PRNG, I said I could come up with pathological seedings in general, as shown with the various correlation programs I'd written previously, and others such as the bad repeats demonstration program for Xoshiro**.

Please try to keep the discussion at a technical level. "Pathological" has no mathematical meaning.

The technically correct way to check for self-correlation is finding two seeds yielding two non-overlapping sequences (non-overlapping for the duration of the test—they will unavoidably overlap if you go far enough), interleave them and pass it to a battery of tests.

If you consider two sequences that overlap, they will be correlated for every generator, even a crypto one, simply because the same outputs will happen twice after the sequences overlap, and any reasonable test will pick up that.

"Pathological seeding" using overlapping sequences is a trivial task for every generator (whatever "pathological" means).

Once again, since you claim to have found correlation similar to PCG (in which the sequences are not overlapping, as the test shows) in other generators, can you provide a pair of correlated, non-overlapping sequences, say, from xoshiro256++ or SFC64?

The technically correct way to check for self-correlation is finding two seeds yielding two non-overlapping sequences (non-overlapping for the duration of the test—they will unavoidably overlap if you go far enough), interleave them and pass it to a battery of tests.

Can you point to the literature for this definition of correlation so I can make sure I stay “technically correct” about that, too?

Sebastiano, you keep wanting me to answer a challenge set on your terms. What you're pointing out relates to intrinsic properties of LCGs, where there is self-similarity. You won't find the same issue on a chaotic PRNG or an LFSR-based one.

But for other PRNGs there will be other weak spots.

LFSRs have zeroland, bad states, hamming-weight issues, linearity, and as we learned with your attempts with xoshiro, sometimes other weirdness like strange problems with repeats.

Chaotic PRNGs have the risk of short cycles (although ones with a counter avoid that — Weyl sequences FTW!) and their intrinsic bias.

If the sequences overlap, as I wrote, the test will _always_ fail. You do not need literature to understand that a test that _always fail_ is not a test.

Once again, since you claim to have found correlation similar to PCG (in which the sequences are not overlapping, as the test shows) in other generators, can you provide a pair of correlated, non-overlapping sequences, say, from xoshiro256++ or SFC64?

You really seem to be dodging the question. It would be very easy for you, following your claims, to provide such evidence, if you had any.

The easiest thing to do right now would be to add a PCG64DXSM BitGenerator, the variant of PCG64 with the "cheap multiplier" and stronger DXSM output function. I think everyone agrees that that's a step up from the XSL-RR output function we have now in our PCG64 implementation, performing better statistically without damage to runtime performance. It's a straightforward upgrade in the niche that PCG64 serves for the BitGenerators that we provide. I think we should add it alongside PCG64.

Note that 64-bit "cheap multipliers" have provable defects. This is known from a long time:

W. Hörmann and G. Derflinger, A portable random number generator well suited for the
rejection method, ACM Trans. Math. Softw. 19 (1993), no. 4, 489–495.

In general, multipliers smaller than the square root of the modulus have inherent limits to their spectral score f₂.

The limit can be easily overcome by using a 65-bit multiplier, which the compiler will transform just in an additional "add" operation, probably not even changing the speed of the generator.

Guy Steele and I worked on the issue a bit and published tables of spectral scores for cheap multipliers of various sizes: https://arxiv.org/pdf/2001.05304.pdf . The larger the better, but there is a provable gap from 64 to 65 bits (for an LCG with 128 bits of state).

For example, from Table 7 of the paper you get 0x1d605bbb58c8abbfd, which has f₂ score 0.9919. No 64-bit multiplier can go beyond 0.9306 (Theorem 4.1 in the paper).

After the mix and everything the improvement in the f₂ score might go entirely unnoticed from a statistical viewpoint. But considering the large improvement you get for the most relevant dimension with just an additional add operation, I think (well, we think, or we wouldn't have written the paper) it is worth the effort.

Sebastiano, you keep wanting me to answer a challenge set on your terms. What you're pointing out relates to intrinsic properties of LCGs, where there is self-similarity. You won't find the same issue on a chaotic PRNG or an LFSR-based one.

Wow, it took a while to get there!

LFSRs have zeroland, bad states, hamming-weight issues, linearity, and as we learned with your attempts with xoshiro, sometimes other weirdness like strange problems with repeats.

I totally agree—that's why you have to scramble them. Note that LFSRs and F₂-linear generators are different things; related, but different.

"Strange problems with repeats" is, as usual, a non-technical term that I cannot comment upon.

Chaotic PRNGs have the risk of short cycles (although ones with a counter avoid that — Weyl sequences FTW!) and their intrinsic bias.

[Update: I missed the parenthesized counter observation, so I'm updating my comment.]

Yes, SFC64 does not have such issues (it uses a counter), so I would not generalize to the entire category. There are carefully designed chaotic generators that have provable, large shortest cycle length.

"Strange problems with repeats" is, as usual, a non-technical term that I cannot comment upon.

It seems odd to not be able to comment because I didn't use the right jargon — run this program and then feel free to enlighten me with the best way to describe the issue in the proper jargon and then provide whatever commentary seems appropriate. I would have imagined that you and David Blackman would have discussed the issue when it first came to light because I corresponded with him about it, but I've never seen you comment on it.

Discussion of PRNGs not in numpy is off-topic. Please use your own forums to continue that discussion. Thank you.

@rkern - that seems a bit strict as a criterion. If there's a deficiency in the Numpy implentation that is not shared by other implementations, that seems reasonable to discuss.

I can confirm that the exchange I'm referring to is not helping me make the decisions that are in front of us in this issue. Until we make progress on that, I need the conversation to stay focused.

I think it is helpful to understand the broader context. Sebastiano has a bit of a thing about PCG and has been railing against it for years. I think these days some people might look at us both and roll their eyes and say “you're both as bad as each other” because I've also critiqued his PRNGs, but actually I only did so after he went around making claims that I was trying to hide something by never talking about his stuff (when in reality, I just didn't have the time/inclination—in fact, I'd just assumed they were fine).

His critiques are useful though, and I'm delighted he's chosen to spend so much of his life thinking about my work, but it's important to realize his tests are adversarial in nature. He uses knowledge of the structure of PCG to contrive arrangements that can be fed into RNG testers and fail tests.

Given how scary that can look, it seems reasonable to show that a similar adversarial approach would also trip up numerous other generators and that many of the concerns he raises about PCG would apply to other generation schemes too, as I've observed about generation schemes like XorWow, and I often use SplitMix as an example, as I'll do below. (None of us are particularly invested in SplitMix one way or the other, I'd imagine.)

We can be super scary about SplitMix, for example, and show that with the default stream constant, if we look at every 35185-th output, it fails in a PRNG test suite. Oh noes! This is because internally it's incrementing a counter (Weyl sequence!) by 0x9e3779b97f4a7c15 (based on φ, the golden ratio), but 35185 * 0x9e3779b97f4a7c15 = 0x86a100000c480245, which only has 14 bits set and a large swath of nothing in the middle. Or if we looked at every 360998717-th output, we get down to being equivalent to an addition to the internal state of 0x48620000800401, which is only 8 bits being added and again something hard for its output function to fully mask.

We could continue scare-mongering about SplitMix and say look, what if I have two streams, one with the additive constant 0x9e3779b97f4a7c15 and other with 0xdaa66d2c7ddf743f, we'd see flaws if we fed this into a PRNG test suite!!! But that's because the second one is contrived to be just 3x the other one.

And finally, if someone said “I'm going to give you both streams, do something scary with that!”, and let's say theirs were based on π (0x243f6a8885a308d3) and _e_ (0xb7e151628aed2a6b), we can say, sure, let's have some more scare-mongering and take every 6561221343-th item from the Pi stream and intermix it with every 6663276199-th item from the E stream and low-and-belhold, they produce two identical sequences. And _worse_, I then go on to show that for every jump on stream a, there is a matching jump on stream b to give the same output, so there are actually 2^64 ways in which they correlate!!! (And we can do this for any two streams, there was nothing special about π and _e_.)

Returning to PCG, Sebastiano's test relies on the two PCG64 XSH RR generators being aligned precisely so that matching outputs are interleaved. If we just advance one of the PRNGs by a small amount, breaking the perfect alignment just a tad, it becomes vastly harder to detect anything suspect.

A similar adversarial test in the other direction (putting a burden on Sebastiano) would be to provide to outputs from PCG64 XSH RR that meet his claim that they are correlated but we don't tell him exactly how they are aligned (they're just in the right general neighborhood). His job would be to find the alignment to show that they are correlated.

Overall, I don't think it's an issue in practice with urgent fires to be put out, but on the other hand, the DXSM version is better as it was written last year to mitigate precisely these kinds of issues, and I'd be delighted to have you switch to it.

P.S. You can make magic Weyl additive constants from your favorite real number using this code:

WeylConst[r_,bits_] = BitOr[Floor[(r-Floor[r])*2^bits],1]

That's Mathematica, I'll leave the Python version as an exercise.

Here's how I construct the lower-bit collisions for different increments.

Results with PCG64 XSL-RR and a lower-58-bit collision

❯ ./pcg64_correlations.py -m 58 | stdbuf -oL ./RNG_test stdin64 -tf 2 -te 1 -tlmaxonly -multithreaded
s0 = 0b01110010100110011101000110010010101111111001100011001011001011111001001110101010011101111101001101011000011100001111111111100001
s1 = 0b10110001011001100111100010000110101110011010101010011011010100011001011111001100010001101001001011010010110101001011101111111100
dist = 0x2eb6ec432b0ea0f4fc00000000000000
[
    {
        "bit_generator": "PCG64",
        "state": {
            "state": 152330663589051481538402839025803132897,
            "inc": 228410650821285501905570422998802152525
        },
        "has_uint32": 0,
        "uinteger": 0
    },
    {
        "bit_generator": "PCG64",
        "state": {
            "state": 235805414096687854712168706130903874556,
            "inc": 70910205337619270663569052684874994465
        },
        "has_uint32": 0,
        "uinteger": 0
    }
]
RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x12d551b8
test set = expanded, folding = extra

rng=RNG_stdin64, seed=0x12d551b8
length= 128 megabytes (2^27 bytes), time= 2.8 seconds
  no anomalies in 891 test result(s)

rng=RNG_stdin64, seed=0x12d551b8
length= 256 megabytes (2^28 bytes), time= 9.4 seconds
  no anomalies in 938 test result(s)

rng=RNG_stdin64, seed=0x12d551b8
length= 512 megabytes (2^29 bytes), time= 18.1 seconds
  no anomalies in 985 test result(s)

rng=RNG_stdin64, seed=0x12d551b8
length= 1 gigabyte (2^30 bytes), time= 31.2 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/8]FPF-14+6/16:cross         R=  +4.9  p =  1.7e-4   unusual          
  [Low4/16]FPF-14+6/16:all          R=  +8.4  p =  2.3e-7   very suspicious  
  [Low4/16]FPF-14+6/16:all2         R=  +8.3  p =  8.1e-5   unusual          
  [Low8/32]FPF-14+6/32:all          R=  +6.3  p =  2.1e-5   mildly suspicious
  [Low8/32]FPF-14+6/16:all          R=  +5.7  p =  8.0e-5   unusual          
  ...and 1034 test result(s) without anomalies

rng=RNG_stdin64, seed=0x12d551b8
length= 2 gigabytes (2^31 bytes), time= 52.7 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low4/16]FPF-14+6/32:all          R=  +7.4  p =  2.0e-6   suspicious       
  [Low4/16]FPF-14+6/16:(0,14-0)     R=  +7.7  p =  9.4e-7   unusual          
  [Low4/16]FPF-14+6/16:all          R=  +8.0  p =  5.9e-7   suspicious       
  [Low4/16]FPF-14+6/16:all2         R= +12.2  p =  2.1e-6   mildly suspicious
  [Low4/16]FPF-14+6/4:(0,14-0)      R=  +7.9  p =  6.3e-7   mildly suspicious
  [Low4/16]FPF-14+6/4:all           R=  +5.8  p =  6.7e-5   unusual          
  [Low4/16]FPF-14+6/4:all2          R= +11.5  p =  3.1e-6   mildly suspicious
  [Low8/32]FPF-14+6/32:(0,14-0)     R=  +7.8  p =  8.4e-7   unusual          
  [Low8/32]FPF-14+6/32:all          R=  +7.3  p =  2.3e-6   suspicious       
  [Low8/32]FPF-14+6/32:all2         R= +14.3  p =  3.8e-7   suspicious       
  [Low8/32]FPF-14+6/16:(0,14-0)     R=  +7.7  p =  8.8e-7   unusual          
  [Low8/32]FPF-14+6/16:(1,14-0)     R=  +7.7  p =  9.3e-7   unusual          
  [Low8/32]FPF-14+6/16:all          R=  +6.9  p =  5.3e-6   mildly suspicious
  [Low8/32]FPF-14+6/16:all2         R= +18.3  p =  8.0e-9   very suspicious  
  ...and 1078 test result(s) without anomalies

rng=RNG_stdin64, seed=0x12d551b8
length= 4 gigabytes (2^32 bytes), time= 90.2 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/8]BCFN_FF(2+0):freq         R= +14.8  p~=   6e-18     FAIL !         
  [Low1/8]BCFN_FF(2+1):freq         R=  +7.4  p~=   1e-6    mildly suspicious
  [Low1/8]FPF-14+6/16:cross         R=  +8.4  p =  2.1e-7   very suspicious  
  [Low4/16]FPF-14+6/32:(0,14-0)     R=  +8.9  p =  8.1e-8   mildly suspicious
  [Low4/16]FPF-14+6/32:(1,14-0)     R=  +8.5  p =  1.9e-7   mildly suspicious
  [Low4/16]FPF-14+6/32:all          R=  +9.4  p =  2.4e-8   very suspicious  
  [Low4/16]FPF-14+6/32:all2         R= +23.9  p =  5.2e-11   VERY SUSPICIOUS 
  [Low4/16]FPF-14+6/16:(0,14-0)     R= +13.8  p =  2.2e-12   VERY SUSPICIOUS 
  [Low4/16]FPF-14+6/16:(1,14-0)     R= +10.0  p =  7.3e-9   suspicious       
  [Low4/16]FPF-14+6/16:all          R= +12.1  p =  8.0e-11   VERY SUSPICIOUS 
  [Low4/16]FPF-14+6/16:all2         R= +52.5  p =  1.3e-22    FAIL !!        
  [Low4/16]FPF-14+6/4:(0,14-0)      R= +12.2  p =  7.0e-11   VERY SUSPICIOUS 
  [Low4/16]FPF-14+6/4:all           R=  +7.1  p =  3.7e-6   mildly suspicious
  [Low4/16]FPF-14+6/4:all2          R= +29.8  p =  7.1e-14    FAIL           
  [Low4/16]FPF-14+6/4:cross         R=  +5.3  p =  7.8e-5   unusual          
  [Low4/32]FPF-14+6/32:(0,14-0)     R=  +7.6  p =  1.3e-6   unusual          
  [Low4/32]FPF-14+6/32:all          R=  +6.0  p =  4.4e-5   unusual          
  [Low4/32]FPF-14+6/32:all2         R=  +9.4  p =  2.9e-5   unusual          
  [Low4/32]FPF-14+6/16:(0,14-0)     R=  +7.3  p =  2.5e-6   unusual          
  [Low4/32]FPF-14+6/16:all          R=  +6.5  p =  1.4e-5   mildly suspicious
  [Low4/32]FPF-14+6/16:all2         R=  +8.2  p =  8.0e-5   unusual          
  [Low8/32]FPF-14+6/32:(0,14-0)     R= +17.2  p =  1.7e-15    FAIL           
  [Low8/32]FPF-14+6/32:(1,14-0)     R= +12.7  p =  2.3e-11   VERY SUSPICIOUS 
  [Low8/32]FPF-14+6/32:all          R= +15.3  p =  7.9e-14    FAIL           
  [Low8/32]FPF-14+6/32:all2         R= +86.1  p =  1.2e-35    FAIL !!!       
  [Low8/32]FPF-14+6/16:(0,14-0)     R= +16.8  p =  3.5e-15    FAIL           
  [Low8/32]FPF-14+6/16:(1,14-0)     R= +12.2  p =  6.6e-11   VERY SUSPICIOUS 
  [Low8/32]FPF-14+6/16:all          R= +13.1  p =  8.9e-12   VERY SUSPICIOUS 
  [Low8/32]FPF-14+6/16:all2         R= +82.1  p =  1.7e-34    FAIL !!!       
  [Low8/32]FPF-14+6/4:(0,14-0)      R= +12.8  p =  2.0e-11   VERY SUSPICIOUS 
  [Low8/32]FPF-14+6/4:(1,14-0)      R=  +9.4  p =  2.5e-8   suspicious       
  [Low8/32]FPF-14+6/4:all           R= +10.5  p =  2.2e-9    VERY SUSPICIOUS 
  [Low8/32]FPF-14+6/4:all2          R= +42.0  p =  5.8e-19    FAIL !         
  ...and 1118 test result(s) without anomalies

Results with PCG64 DXSM and a lower-64-bit collision (to provoke problems faster, though I see none)

❯ ./pcg64_correlations.py -m 64 --dxsm | stdbuf -oL ./RNG_test stdin64 -tf 2 -te 1 -tlmaxonly -multithreaded
s0 = 0b10001000010110111101010101010101111100100011011111011111011111001011110101111100101101101100110101110001101101111111010101111111
s1 = 0b11000101110100011001011000001110100001001111001001100101010000101100011001010111011001100000010010011100101110001110101000011100
dist = 0x3a26b19c91e6da1d0000000000000000
[
    {
        "bit_generator": "PCG64DXSM",
        "state": {
            "state": 181251833403477538233003277050491434367,
            "inc": 46073632738916603716779705377640239269
        },
        "has_uint32": 0,
        "uinteger": 0
    },
    {
        "bit_generator": "PCG64DXSM",
        "state": {
            "state": 262946148724842088422233355148768897564,
            "inc": 125105549038853892415237434774494719583
        },
        "has_uint32": 0,
        "uinteger": 0
    }
]
RNG_test using PractRand version 0.93
RNG = RNG_stdin64, seed = 0x85cea9
test set = expanded, folding = extra

rng=RNG_stdin64, seed=0x85cea9
length= 128 megabytes (2^27 bytes), time= 2.6 seconds
  no anomalies in 891 test result(s)

rng=RNG_stdin64, seed=0x85cea9
length= 256 megabytes (2^28 bytes), time= 9.4 seconds
  no anomalies in 938 test result(s)

rng=RNG_stdin64, seed=0x85cea9
length= 512 megabytes (2^29 bytes), time= 18.5 seconds
  no anomalies in 985 test result(s)

rng=RNG_stdin64, seed=0x85cea9
length= 1 gigabyte (2^30 bytes), time= 32.3 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low4/32]BCFN_FF(2+3,13-3,T)      R=  -8.3  p =1-9.5e-5   unusual          
  ...and 1035 test result(s) without anomalies

rng=RNG_stdin64, seed=0x85cea9
length= 2 gigabytes (2^31 bytes), time= 55.8 seconds
  no anomalies in 1092 test result(s)

rng=RNG_stdin64, seed=0x85cea9
length= 4 gigabytes (2^32 bytes), time= 93.1 seconds
  no anomalies in 1154 test result(s)

rng=RNG_stdin64, seed=0x85cea9
length= 8 gigabytes (2^33 bytes), time= 175 seconds
  no anomalies in 1222 test result(s)

rng=RNG_stdin64, seed=0x85cea9
length= 16 gigabytes (2^34 bytes), time= 326 seconds
  no anomalies in 1302 test result(s)

rng=RNG_stdin64, seed=0x85cea9
length= 32 gigabytes (2^35 bytes), time= 594 seconds
  no anomalies in 1359 test result(s)

rng=RNG_stdin64, seed=0x85cea9
length= 64 gigabytes (2^36 bytes), time= 1194 seconds
  no anomalies in 1434 test result(s)

rng=RNG_stdin64, seed=0x85cea9
length= 128 gigabytes (2^37 bytes), time= 2334 seconds
  no anomalies in 1506 test result(s)
...

@rkern, thanks for sharing your code. It'd be instructive to add an option to add a skew to feed interleaved output into the tester that is not perfectly aligned. That's something I explored a little.

Yup, I've done that informally by inserting bg0.advance(N) for various N before the final return. I used the lower-64-bit collision to be sure I'd see something. Tiny shifts like 16 don't change the failure much, but even modest shifts like 128 prolong the failure to 32 GiB.

It looks like we should add PCG64 DXSM as an optional bitgenerator and eventually make it the default. To we have an implementation?

I think it is helpful to understand the broader context. Sebastiano has a bit of a thing about PCG and has been railing against it for years. I think these days some people might look at us both and roll their eyes and say “you're both as bad as each other” because I've also critiqued his PRNGs, but actually I only did so after he went around making claims that I was trying to hide something by never talking about his stuff (when in reality, I just didn't have the time/inclination—in fact, I'd just assumed they were fine).

I think these considerations are entirely inappropriate.

Insisting in attacking other people's work (e.g., SplitMix) without any merit and without any evidence is not going to make the PCG mess, or Numpy's generator, better. A better multiplier, or a better-designed scrambler, might help, instead.

I am still waiting for a test showing correlation in SplitMix when the user can choose the streams. Just to be clear, for Numpy's generator I proved a statement of the form

∀c ∀d ∀x ∃y correlated

where c, d are increments ("streams"), and x, y are initial states. In fact, there are 2^72 y's. That is, no matter how you choose c, d, and x, there are 2^72 y's showing correlation.

The alleged corresponding code you provided for SplitMix shows that

∃c ∃d ∃x ∃y correlated

That is, choosing adversarially c, d, x and y you can show correlation.

The difference in strength in the two statements is quite staggering. Trying to conflate the two statements is incorrect.

@vigna you have been warned twice now about our code of conduct, by @mattip and @rkern. Using language like "thrashing other people's work" and "Trying to conflate the two statements is pure FUD" is not okay. Consider this your last warning. Please change your tone or we _will_ ban you. Technical arguments are still welcome, anything else is not at this point.

I modified the message replacing those expressions with neutral ones. I still think that attacking personally another participant to the discussion ("Sebastiano has a bit of a thing about PCG and has been railing against it for years") is entirely inappropriate. I'm very surprised is not for you.

For the third and last time, the discussion about SplitMix, in either direction, is not helping me in the slightest. I can understand why you think it provides needed context, or that you feel compelled to respond to the other, but please trust that I am telling you the truth that it is not providing me any information that helps me make a decision here. You both have your own websites. Use them.

I modified the message replacing those expressions with neutral ones.

Thank you.

I still think that attacking personally another participant to the discussion ("Sebastiano has a bit of a thing about PCG and has been railing against it for years") is entirely inappropriate. I'm very surprised is not for you.

I'd prefer to not see that either indeed. However the tone of that message isn't nearly as bad.

I'd much appreciate it if you both could stick to constructive factual statements @vigna and @imneme.

OK. Let's start from scratch: you want a generator with some kind of streams based on LCG with power-of-2 modulus for convenience and speed. The literature suggests that basing streams on LCG additive constants might lead you into problems (as it happens now), but let's assume that's what you want.

Why not taking an LCG with 128-bit of state and a good multiplier (at least 65 bits) and perturb the upper bits using the mix function from SplitMix, which has been heavily tested in different applications (hashing, PRNG, etc.), giving excellent results?

I'm pretty sure the difference in speed will be marginal. And you have some (statistical) guarantee that the result will depend on all bits, which is the issue here.

This seems to me more of a "standing on the shoulder of giants" approach than handcrafting mixing functions on a generator that has self-correlation problems.

@imneme What I could use is a blog post about DXSM that's easier to link to than this announcement comment in the old mega-issue. It doesn't have to be much more than what's in that comment, but including the current status of the testing that you mentioned here would be good. If you wanted to summarize some of the discussion from the mega-issue that led to that development, that would be useful, for sure, but not entirely necessary.

@vigna

Why not taking an LCG with 128-bit of state and a good multiplier (at least 65 bits) and perturb the upper bits using the mix function from SplitMix, which has been heavily tested in different applications (hashing, PRNG, etc.), giving excellent results?

I apologize if this sounds snarky (though it certainly will be pointed), but it is also sincere: I look forward to seeing the implementation, analysis, benchmarks, and PractRand results on your website or on arXiv. We are (reasonably informed) practitioners here, not PRNG researchers, and are not particularly well-equipped to carry out this suggestion. I can see the sense of it, but given the other constraints on my personal time, I don't have an inclination to spend the effort to take this from the suggestion to an implementation and analysis. If you are addressing this suggestion to numpy, we need PRNG researchers to do that work. If you are really addressing this suggestion to someone else, use your website.

The random Generator -> BitGenerator -> SeedSequence architecture in NumPy is meant to be pluggable. I think we are to the point in the discussion where we need someone to open a PR for a BitGenerator, so we can compare its practical attributes with the ones currently in NumPy. Once it becomes part of the project, we can continue to test it and may decide to make it the default. That decision would, I hope, be based on

  • lack of bias (and other criteria? I cede to the experts)
  • performance
  • probability of various types of stream collision via the normative interfaces we promote: using BitGenerator.spawn and SeedSequence.

Personally, this discussion lost me when it avoided discussing the merits of BitGenerators via code that uses the spawn interface. There is a reason we promote it as best practice, and I would hope the discussion of the future PR would focus on best practices for NumPy users.

Perhaps one of the conclusions here might be that we should only allow spawn as a method, since using jumped or advance may run afoul of good practices. A new issue or NEP focused on this might also be productive.

@mattip The lower-bits birthday collision that @vigna noted affects our SeedSequence.spawn() interface as well. Please be assured that any part of the discussion I have engaged in is relevant to proper usage of our APIs.

It only requires adding about 8 lines to pcg64.c with some #ifdef blocks to use @rkern 's preferred approach of completely separate generators. The pyx/pxd would by otherwise identical to the PCG64 class only being built with the correct defines (PCG_DXSM=1) and an updated docstring.

I'd probably be more explicit about it, especially for the emulated 128-bit math for those platforms that need it.

https://github.com/rkern/numpy/compare/v1.17.4...rkern%3Awip/pcg64-dxsm

It seemed easier than that to me since it uses a "cheap" 64 bit multiplier. You can just add a new output mixer (that is invariant) and then ifdef around the final line of the random generator which takes the output of the LCG and then applies the mixer.

https://github.com/bashtage/randomgen/commit/63e50a63f386b5fd725025f2199ff89454321f4c#diff-879bd64ee1e2b88fec97b5315cf77be1R115

If one wanted to you could even add in the Murmur Hash 3 at this point, were one so inclined.

Are the if statements going to get compiled away? I don't think we want multiples of them inside hot loops.

Again, this comes down to the difference in purpose between randomgen and numpy. It makes sense in randomgen to make parameterized families, but in numpy, I don't think it's a good idea to entangle the implementations of a legacy BitGenerator from the active default BitGenerator. If we have to do any maintenance or refactorings for performance on one or the other, it's just going to make that effort worse rather than better.

Agree with Robert here. I have no qualms about putting a new bit generator in the 1.19.0 release, it would not change any current behavior.

@bashtage Also, note that pcg_cm_random_r() uses the pre-iterated state to output rather than the post-iterated state, so it's not going to be as simple to maintain the same codepath with #ifdef or if switches.

Are the if statements going to get compiled away? I don't think we want multiples of them inside hot loops.

No, in NumPy the if else should become something like

#if defined(PCG_DXSM)
    pcg_output_dxsm(state.high, state.low)
#else 
   <old way>
#endif

These need to defined separately in the uint128 version and in the fall back version to handle manually shifting the uint128 to high and low.

@bashtage Also, note that pcg_cm_random_r() uses the pre-iterated state to output rather than the post-iterated state, so it's not going to be as simple to maintain the same codepath with #ifdef or if switches.

Hmm, I tested against @imneme reference implementation and got a 100% match on 1000 values using 2 distinct seeds:

https://github.com/bashtage/randomgen/blob/master/randomgen/src/pcg64/pcg_dxsm-test-data-gen.cpp

AFAICT (and I may be wrong)

https://github.com/imneme/pcg-cpp/blob/master/include/pcg_random.hpp#L174

and

https://github.com/imneme/pcg-cpp/blob/master/include/pcg_random.hpp#L1045

mean that the uint_128 path is always using a cheap multiplier.

I'm not sure what you're trying to say there.

It is unclear what the canonical PCG64 DXSM is. In either case the output function uses only 64-bit operations. The version you have uses a 64 bit multiplier in another location to be even faster and returns pre, rather than post. setseq_dxsm_128_64 seems like the natural extension of the existing PCG64 and only changes the output function.

Oh, I see. No, you used a different C++ generator than the one I implemented in C. I'm implemented the equivalent of cm_setseq_dxsm_128_64 which uses the "cheap multiplier" in the LCG iteration, not setseq_dxsm_128_64 which still uses the big multiplier in the LCG iteration. The "cheap multiplier" gets reused inside of the DXSM output function, but that's an orthogonal design axis.

Why not prefer setseq_dxsm_128_64?

@imneme said that she was going to eventually change the official pcg64 in the C++ version to point to cm_setseq_dxsm_128_64, not setseq_dxsm_128_64. The cheap multiplier offsets some of the extra cost of DXSM compared to XSL-RR. And I think it's the variant that she spent a couple of months testing.

Outputting the pre-iterated state also is part of the performance bump.

Here are some timings:

In [4]: %timeit p.random_raw(1000000)
3.24 ms ± 4.61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: p = rg.PCG64(mode="sequence",use_dxsm=False)

In [6]: %timeit p.random_raw(1000000)
3.04 ms ± 8.47 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: import numpy as np

In [8]: p = np.random.PCG64()

In [9]: %timeit p.random_raw(1000000)
3.03 ms ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

All on Ubuntu-20.04, default compiler.

6% slower. Seems like a small difference to me. All timings in NumPy/randomgen are pretty far off from what you can get in an actual tight loop of native code.

Compare

In [10]: x = rg.Xoroshiro128(mode="sequence")

In [11]: %timeit x.random_raw(1000000)
2.59 ms ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

to what has been published from C code (150% slower??).

Granted, in the numpy context, it doesn't matter much. It does matter more in the C++ context, which drove the allocation of cluster-months to testing and the nomination as the future default pcg64 in the official C++ code. Those are the proximate motivations for numpy, IMO.

The difference between stock PCG64 and my PCG64DXSM on my branch ("cheap multiplier", DXSM output function, outputting pre-iterated state, separate codepath):

[practrand]
|1> s = np.random.SeedSequence()

[practrand]
|2> pcg64 = np.random.PCG64(s)

[practrand]
|3> pcg64dxsm = np.random.PCG64DXSM(s)

[practrand]
|4> %timeit pcg64.random_raw(1000000)
100 loops, best of 3: 3.46 ms per loop

[practrand]
|5> %timeit pcg64dxsm.random_raw(1000000)
100 loops, best of 3: 2.9 ms per loop

I still say it is only a few (more than I had) #ifdefs between the two, even with a specialized implementation for MSVC. Plus RSN MS users will be able to use clang #13816 👍.

Are we arguing over code duplication? I'd far rather have disjoint implementations than worry about a few lines of code obfuscated with #ifdefs :)

It was mostly just a joke, although it did highlight the need for an absolutely clear statement defining "PCG 2.0" (preferably somewhere that is not NumPy's GitHub issues) .

Thanks, @rkern et al.

@imneme What I could use is a blog post about DXSM that's easier to link to than this announcement comment in the old mega-issue. It doesn't have to be much more than what's in that comment, but including the current status of the testing that you mentioned here would be good. If you wanted to summarize some of the discussion from the mega-issue that led to that development, that would be useful, for sure, but not entirely necessary.

Do you have a time-frame in mind? I've been meaning to do this for some time and having it pushed out by other things, so actually having a proposed deadline would be a useful motivator for me. A week perhaps? Two?

@rkern also quoted @vigna, who wrote:

Why not taking an LCG with 128-bit of state and a good multiplier (at least 65 bits) and perturb the upper bits using the mix function from SplitMix, which has been heavily tested in different applications (hashing, PRNG, etc.), giving excellent results?

FWIW, this approach was discussed in the original PCG paper, using _FastHash_ as the off-the-shelf hash function, which is a very similar multiply-xorshift hash function. In my testing, it wasn't as fast as other permutations, but was of high quality. Sebastiano also mentioned this idea in his 2018 critique of PCG and I discuss it in this section of my response to that critique.

In the original version of his PCG critique, he ends by writing his own PCG variant, which I'll quote below:

        #include <stdint.h>

        __uint128_t x;

        uint64_t inline next(void) {
            // Put in z the top bits of state
            uint64_t z = x >> 64;
            // Update state
            x = x * ((__uint128_t)0x2360ed051fc65da4 << 64 ^ 0x4385df649fccf645)
                  + ((__uint128_t)0x5851f42d4c957f2d << 64 ^ 0x14057b7ef767814f);
            // Compute mix
            z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
            z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
            return z ^ (z >> 31);
        }

He's since updated the code in his critique to an even faster version that uses a cheaper multiplier and cut the additive constant down to only 64 bits,

        #include <stdint.h>

        __uint128_t x;

        uint64_t inline next(void) {
            // Put in z the top bits of state
            uint64_t z = x >> 64;
            // Update state (multiplier from https://arxiv.org/abs/2001.05304)
            x = x * ((__uint128_t)1 << 64 ^ 0xd605bbb58c8abbfd) + 0x14057b7ef767814f;
            // Compute mix
            z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
            z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
            return z ^ (z >> 31);
        }

My issue with both these variants is that the permutation is invertible because truncated state (the upper half) is permuted/scrambled — you can run it backwards and unscramble the scrambling leaving you with a mere truncated LCG with all the inherent flaws therein. My preference is to permute/scamble the entire state and then output a truncation of that. (Of course, permuting/scrambling fewer bits is going to be the faster way — as usual there are trade offs. Reasonable people can disagree one what is important.)

But his work making his own PCG variant provided very useful inspiration for the DXSM permutation when I wrote that last year.

@charris What's your appetite for making the PCG64DXSM implementation available (but not yet default) into 1.19.0? What's that timeline? I see we already have 1.19.0rc2 released, which is not _great_ for introducing a new feature. Again, I'm not hair-on-fire about this issue. I'd lean towards releasing 1.19.0 just documenting our policy about changes to default_rng() and introducing new stuff in 1.20.0.

@rkern The final rc needs to be out there for > two weeks, so we are looking at a release sometime in the latter half of June. I'm in favor of putting the PCG64DXSM in as an optional choice if it facilitates testing, I don't really regard it as a new feature, more like a new accessory. And sometimes it helps move things along to have actual working code out there. NumPy is a trend setter :)

EDIT: Assuming, of course, that there are no big issues with the new code, and it doesn't look like there are. I am also not too worried about problems with PCG64, it seems unlikely anyone will have problems using our recommended procedures.

@imneme One week would be great. Two weeks would be fine. Thanks!

There is a question I've been asking myself which is somewhat off topic. We want our bit generators to produce random bits, but AFAICT, most testing involves integers. How well do the existing tests do in actually testing the bits? If someone more familiar with the area could answer that question I would be most grateful.

What's getting tested is the stream of bits. We do tell the test software the natural word size of the PRNG that we're outputting, but only so that it can do the best foldings of the bits to most efficiently provoke errors that tend to crop up in the low or high bits of the word in bad PRNGs. The software we all tend to use these days is PractRand, and its tests are lightly documented here. The best paper to read is probably the one for TestU01, the previous gold standard test suite. Its User's Guide has more detail on the tests.

I apologize if this sounds snarky (though it certainly will be pointed), but it is also sincere: I look forward to seeing the implementation, analysis, benchmarks, and PractRand results on your website or on arXiv. We are (reasonably informed) practitioners here, not PRNG researchers, and are not particularly well-equipped to carry out this suggestion. I can see the sense of it, but given the other constraints on my personal time, I don't have an inclination to spend the effort to take this from the suggestion to an implementation and analysis. If you are addressing this suggestion to numpy, we need PRNG researchers to do that work. If you are really addressing this suggestion to someone else, use your website.

I can perfectly understand your viewpoint. The code and benchmark have been at the bottom of the page commenting the problems of PCG (http://prng.di.unimi.it/pcg.php) since a couple of years with name LCG128Mix. It takes 2.16ns on my hardware, an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, with gcc 9.2.1 and -fno-move-loop-invariants -fno-unroll-loops.

The code is very simple—it combines a standard LCG with a standard mixing function (Stafford's improved MurmurHash3's finalizer). I slightly modified it to have a programmable constant:

    #include <stdint.h>
    __uint128_t x; // state
    __uint64_t c;  // stream constant (odd)

    uint64_t inline next(void) {
        // Put in z the top bits of state
        uint64_t z = x >> 64;
        // Update LCG state
        x = x * ((__uint128_t)1 << 64 ^ 0xd605bbb58c8abbfd) + c;
        // Compute mix
        z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
        z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
        return z ^ (z >> 31);
    }

As I explained before, the 65-bit constant is much better than any 64-bit constant as a multiplier, due to theoretical problems of multipliers smaller than the square root of the modulus.

If you're interested in a more principled design, I will run PractRand tests. You must take into consideration, however, that this mixing function yielded an excellent generator, SplitMix, even with a much weaker underlying generator (it was just additive), and with smaller state (64 bits). So it's just going to be just "better" than SplitMix, which passes PractRand at 32TB.

And the underlying generator is an LCG, so you have all the usual bells and whistles from the 60's: jumps, distances, etc. But also you have a statistical guarantee that every bit of the result depends on every bit of the higher 64 bits of state.

If you have in mind benchmarks against some other generators, or using other compilers, please let me know.

But, please, in the same sincere way: only if you are really interested in considering a "standing on the shoulders of giants" design using standard components only. There are personal constraints on my time, too, and I'm happy to contribute, but I would like to avoid spending time on a generator that has no chance to be considered.

BTW, to give a more tangible measure of how to improve the quality of the multipliers involved, I computed the spectral scores, from f₂ to f₈, of the current 64-bit multiplier used by PCG DXS and some alternative.

Spectral scores are the standard way to judge the goodness of a multiplier. 0 is bad, 1 is excellent. Each score describe how well are distributed the pairs, triples, 4-tuples, etc. in the output.

These seven numbers can be resumed in the classic measure, the minimum, or a weighted measure (the first score, plus the second divided by two, etc., normalized), to make the first scores more important, as suggested by Knuth in TAoCP, and these are the minimum and the weighted measure for the current multiplier:

0xda942042e4dd58b  0.633  0.778

There are much better 64-bit constants than that:

0xff37f1f758180525  0.761  0.875

If you go to 65 bits, essentially at the same speed (at least, for LCG128Mix is the same speed) you get a better weighted measure:

0x1d605bbb58c8abbfd  0.761  0.899

The reason is that 64-bit multipliers have an intrinsic limit to their f₂ score (≤0.93), which is as noted by Knuth is the most relevant:

0xda942042e4dd58b5  0.795
0xff37f1f758180525  0.928
0x1d605bbb58c8abbfd  0.992

So the first multiplier has a mediocre f₂ score. The second multiplier gets very close to the optimum for a 64-bit multiplier. The 65-bit multiplier does not have these limitations and has a score very close to 1, the best possible in general.

For completeness, here are all the scores:

 0xda942042e4dd58b5  0.794572 0.809219 0.911528 0.730396 0.678620 0.632688 0.639625
 0xff37f1f758180525  0.927764 0.913983 0.828210 0.864840 0.775314 0.761406 0.763689 
0x1d605bbb58c8abbfd  0.991889 0.907938 0.830964 0.837980 0.780378 0.797464 0.761493

You can recompute these score or look for your own multiplier with the code Guy Steele and I distributed: https://github.com/vigna/CPRNG . The better multipliers are taken from the associated paper.

PCG will probably be a fine default prng for numpy, but I don't think it will stand the test of time, as there are more promising, but less tested ways to do this. I propose one in the following.

The half chaotic SFC64 is one of the fastest of the statistically sound generators with a reasonable large minumum period. SFC64 has no jump functions, but can _without speed overhead be extended to support 2^63 guarateed unique streams_. Simply add a Weyl sequence with a user chosen additive constant k (must be odd), instead of just incrementing the counter by one. Each odd k produces a unique full period. It requires additional 64-bits of state to hold the Weyl constant:

typedef struct {uint64_t a, b, c, w, k;} sfcw64_t; // k = stream

static inline uint64_t sfcw64_next(sfcw64_t* s) {
    enum {LROT = 24, RSHIFT = 11, LSHIFT = 3};
    const uint64_t out = s->a + s->b + (s->w += s->k);
    s->a = s->b ^ (s->b >> RSHIFT);
    s->b = s->c + (s->c << LSHIFT);
    s->c = ((s->c << LROT) | (s->c >> (64 - LROT))) + out;
    return out;
}

A 320 bits state is sometimes undesirable, so I have tried to squeeze it down to using 256 bits again. Note the changed output function too, which better utilizes the Weyl sequence for bit mixing. It uses 128/128 bits chaotic/structured state, which seems to strike a good balance:
/EDIT: removed rotl64() from output func + cleanup, Aug. 6:

typedef struct {uint64_t a, b, w, k;} tylo64_t;

static inline uint64_t tylo64_next(tylo64_t* s) {
    enum {LROT = 24, RSHIFT = 11, LSHIFT = 3};
    const uint64_t b = s->b, out = s->a ^ (s->w += s->k);
    s->a = (b + (b << LSHIFT)) ^ (b >> RSHIFT);
    s->b = ((b << LROT) | (b >> (64 - LROT))) + out;
    return out;
}

This has currently passed 4 TB in PractRand testing without anomalies, and I briefly ran Vigna's Hamming-weight test without issues so far (although, passing these tests are no guarantee for near true random output, rather a test whether the prng is flawed or not).

Note: it is supposedly statistcally an advantage to use a (unique) random Weyl constant with roughly 50% of the bits set, but only further testing or analysis will reveal how significant this is.

/Edits: cleanups.

@tylo-work SFC64 is already in NumPy, along with Philox, this is about the default generator.

Ok, I didn't know exactly which were implemented, so this is only about selecting the most suited overall from those? Fair enough, and thanks for clarifying.

I will try to test my proposed generator extensively to see how it stacks up with others, and so far it looks very good regarding speed, output quality, simplicity/size/portablity, and for massive parallel usage. But I would be happy if others tested it as well.

I don't think we're reopening the discussion about the default PRNG from scratch. We have a very specific issue with our current PRNG and are looking at available, closely-related variants that address that specific issue. One of our concerns is that the current default PRNG exposes certain features of the PRNG, like jumpability, that the variant that replaces it must still expose. SFC64 (either ours or yours) doesn't have that feature.

It's possible that that @bashtage would be willing to accept a PR for randomgen to add your Weyl-stream variants of SFC64.

@tylo-work If you are interesting in parallel execution you might want to look at NumPy's SeedSequence implementation.

I don't think we're reopening the discussion about the default PRNG from scratch. We have a very specific issue with our current PRNG and are looking at available, closely-related variants that address that specific issue.

Assuming you want something PCG-DXS-like, there are further improvements you can do with just better constants (and a very marginal slowdown). For example, PCG-DXS will soon fail two distinct type of tests on two interleaved, correlated subsequences with the same lower 112 bits of state:

rng=PCGDXS_int112, seed=0x4d198651
length= 128 gigabytes (2^37 bytes), time= 5700 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/64]TMFn(0+2):wl             R= +57.3  p~=   2e-27     FAIL !!
  [Low8/64]FPF-14+6/64:(1,14-0)     R= +17.5  p =  8.0e-16    FAIL
  [other failures in the same tests]
  ...and 1893 test result(s) without anomalies

Note that we're talking about just ≈65536 correlated sequences—nothing to be afraid of.

But you can improve the generator by choosing a better multiplier, such as 0x1d605bbb58c8abbfd, and a better mixer, such as 0x9e3779b97f4a7c15. The first number is a 65-bit multiplier that has much better spectral scores. The second number is just the golden ratio in a 64-bit fixpoint representation, and this is known to have nice mixing properties (see Knuth TAoCP on multiplicative hashing); for example, it is used by the Eclipse Collections library to mix hash codes.

As a result, you fail just FPF for the same amount of data:

rng=PCG65-DXSϕ_int112, seed=0x4d198651
length= 128 gigabytes (2^37 bytes), time= 5014 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low8/64]FPF-14+6/64:(0,14-0)     R= +16.1  p =  1.5e-14    FAIL
  [other failures in the same test]
  ...and 1892 test result(s) without anomalies

In fact, if we go further at 2TB PCG-DXS fails _three_ type of tests for the same interleaved, correlated subsequences:

rng=PCGDXS_int112, seed=0x4d198651
length= 2 terabytes (2^41 bytes), time= 53962 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/32]TMFn(0+0):wl             R= +50.2  p~=   4e-23     FAIL !!
  [Low8/64]FPF-14+6/64:(1,14-0)     R=+291.1  p =  4.7e-269   FAIL !!!!!!
  [Low8/64]Gap-16:B                 R= +19.5  p =  1.4e-16    FAIL !
  [other failures in the same tests]
  ...and 2153 test result(s) without anomalies

whereas PCG65-DXSϕ still fails just FPF:

rng=PCGDXS65ϕ_int112, seed=0x4d198651
length= 2 terabytes (2^41 bytes), time= 55280 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low8/64]FPF-14+6/64:(0,14-0)     R=+232.1  p =  2.0e-214   FAIL !!!!!!
  [other failures in the same test]
  ...and 2153 test result(s) without anomalies

Sooner or later, of course, also PCG65-DXSϕ will fail Gap and TMFn. But you need to see much more output than with PCG-DXS.

This is the complete code for PCG65-DXSϕ, which is just PCG-DXS with better constants:

#include <stdint.h>

__uint128_t x; // State
uint64_t c; // Additive constant

static inline uint64_t output(__uint128_t internal) {
    uint64_t hi = internal >> 64;
    uint64_t lo = internal;

    lo |= 1;
    hi ^= hi >> 32;
    hi *= 0x9e3779b97f4a7c15;
    hi ^= hi >> 48;
    hi *= lo;
    return hi;
}

static uint64_t inline next(void) {
    __uint128_t old_x = x;
    x = x *  ((__uint128_t)1 << 64 ^ 0xd605bbb58c8abbfd) + c;
    return output(old_x);
}

The marginal slowdown is due to an add instruction (caused by the 65-bit multiplier), and having two 64-bit constants to load.

I'm not endorsing generators of this kind in general, but PCG65-DXSϕ is measurably better than PCG-DXS at hiding correlation.

@Vigna, FYI, I also did some interleaving testing, and noticed that xoshiro256** failed rather quickly when creating 128 interleaved streams or more. With 256 it failed quickly. The point of the test is to check how well the PRNGs behave when each stream was initialized with some linear dependencies. Essentially, the state is initialized to s[0]=s[1]=s[2]=s[3] = k1 + stream*k2. Then 12 outputs are skipped, which is basically how sfc64 is initialized.

I realize, this is not the recommended initialization for xoshiro, but it is still interesting - and a little worrying - that the tests seemed fine for xoshiro with few interleaving streams, but failed with many.

seed: 1591888413
RNG_test using PractRand version 0.95
RNG = RNG_stdin64, seed = unknown
test set = core, folding = standard (64 bit)
...
rng=RNG_stdin64, seed=unknown
length= 2 gigabytes (2^31 bytes), time= 29.6 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/64]FPF-14+6/16:(1,14-1)     R=  +7.2  p =  3.7e-6   unusual
  [Low1/64]FPF-14+6/16:all          R=  +9.6  p =  1.8e-8   very suspicious
  ...and 261 test result(s) without anomalies

rng=RNG_stdin64, seed=unknown
length= 4 gigabytes (2^32 bytes), time= 55.5 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/64]FPF-14+6/16:(0,14-0)     R= +13.4  p =  4.7e-12   VERY SUSPICIOUS
  [Low1/64]FPF-14+6/16:(1,14-0)     R=  +9.4  p =  2.6e-8   suspicious
  [Low1/64]FPF-14+6/16:(2,14-1)     R=  +7.7  p =  1.3e-6   unusual
  [Low1/64]FPF-14+6/16:all          R= +17.4  p =  8.8e-16    FAIL !
  ...and 275 test result(s) without anomalies

I also tried to weaken the initialization for SFC64 and TYLO64 to skip only 2 outputs, however they still seemed OK.
On performance: xoshiro256** runs 33% slower on my machine than the two others. TYLO64 only updates 196 bits of state vars. Here is the test program:

int main()
{
    //FILE* f = freopen(NULL, "wb", stdout);  // Only necessary on Windows, but harmless.
    enum {THREADS = 256};
    uint64_t seed = 1591888413; // <- e.g. this fails. // (uint64_t) time(NULL); 
    fprintf(stderr, "seed: %lu\n", seed);

    static tylo64_t tyl[THREADS];
    static sfc64_t sfc[THREADS];
    static uint64_t xo[THREADS][4];

    for (size_t i = 0; i < THREADS; ++i) {
    tyl[i] = tylo64_seed(seed + (12839732 * i), 19287319823 * i);
    sfc[i] = sfc64_seed(seed + (12839732 * i));
    xo[i][0] = xo[i][1] = xo[i][2] = xo[i][3] = seed + (12839732 * i);
    for (int j=0; j<12; ++j) xoshiro256starstar_rand(xo[i]);
    }
    static uint64_t buffer[THREADS];
    size_t n = 1024 * 1024 * 256 / THREADS;

    while (1/*n--*/) {
        for (int i=0; i<THREADS; ++i) {
        //buffer[i] = tylo64_rand(&tyl[i]);
        //buffer[i] = sfc64_rand(&sfc[i]);
            buffer[i] = xoshiro256starstar_rand(xo[i]);
        }
        fwrite((void*) buffer, sizeof(buffer[0]), THREADS, stdout);
    }
    return 0;
}

I'll include some relevant header code:

typedef struct {uint64_t a, b, w, k;} tylo64_t; // k = stream

static inline uint64_t tylo64_rand(tylo64_t* s) {
    enum {LROT = 24, RSHIFT = 11, LSHIFT = 3};
    const uint64_t b = s->b, w = s->w, out = (s->a + w) ^ (s->w += s->k);
    s->a = (b + (b << LSHIFT)) ^ (b >> RSHIFT);
    s->b = ((b << LROT) | (b >> (64 - LROT))) + out;
    return out;
}

/* stream in range [0, 2^63) */
static inline tylo64_t tylo64_seed(const uint64_t seed, const uint64_t stream) {
    tylo64_t state = {seed, seed, seed, (stream << 1) | 1};
    for (int i = 0; i < 12; ++i) tylo64_rand(&state);
    return state;
}

static inline uint64_t rotl(const uint64_t x, int k) {
    return (x << k) | (x >> (64 - k));
}
static inline uint64_t xoshiro256starstar_rand(uint64_t* s) {
    const uint64_t result = rotl(s[1] * 5, 7) * 9;
    const uint64_t t = s[1] << 17;
    s[2] ^= s[0];
    s[3] ^= s[1];
    s[1] ^= s[2];
    s[0] ^= s[3];
    s[2] ^= t;
    s[3] = rotl(s[3], 45);
    return result;
}

@tylo-work I appreciate the analysis, but I really need this issue to stay focused. If you would like to continue that line of discussion, I encourage you to post your work in your own Github repo, and make one more post here inviting people here to it. Everyone else, please respond there. Thank you for your cooperation.

@imneme @rkern Time is running out for the 1.19 release.

@rkern Looks like PCG64DXSM won't make it into 1.19.0, I'll be releasing this weekend. If you could write the note about our change policy/upcoming changes that you mentioned above I would appreciate it.

Sorry, I've been dealing with some other unrelated matters. Based on our discussion, I don't think a small delay is a big issue, since PCG64DXSM was planned as an alternate option, not a new default (for now, at least).

Now that 1.20 is starting up, is it time to revisit this and move to DXSM?

We would still have some time to do the move before branching, but it may be good to get a start on it within the next week or so. @bashtage I guess you have the PCG64DXSM ready to go and this mainly needs the decision to flip the switch on the default stream?

From what it see, it sounded like we should just do this for 1.20 if we have it readily available.

IIRC, we have been waiting for a reference that could be linked. But if the random number folks are happy with the change we should use it. Do we need any special code for windows?

It is just a different constant and a different scrambling function. Nothing more novel than what @rkern wrote for the original PCG64 implementation on Windows. I think the decision was to have a fully standalone PCG64DXSM rather than to share some code (for performance).

Would probably made sense to start from rkern's WIP branch.

I said I would write some a blog post about it, which I think @rkern wanted, but I've been attending to some other matters and it hasn't happened yet (sorry). In the meantime, the DXSM permutation has been grinding away under test and continues to seem like an improvement over the original. From remarks earlier in the thread, I think @rkern might have liked an even stronger output permutation, but doing so either costs you speed or (if you cut corners to gain speed) adds trivial predictability.

Was this page helpful?
0 / 5 - 0 ratings