Numpy: Decide on new PRNG BitGenerator default

Created on 27 May 2019  ·  166Comments  ·  Source: numpy/numpy

13163 will be bringing in the long-awaited replacement of numpy's PRNG infrastructure. In the interest of keeping that PR manageable, we will merge it to master before all of the decisions are finalized, like which BitGenerator will be nominated as the default.

We must make a decision before the first release with the new infrastructure. Once released, we will be stuck with our choice for a while, so we should be sure that we are comfortable with our decision.

On the other hand, the choice of the default does not have that many consequences. We are not talking about the default BitGenerator underlying the numpy.random.* convenience functions. Per NEP 19, these remain aliases to the legacy RandomState, whose BitGenerator remains MT19937. The only place where the default comes in is when Generator() is instantiated without arguments; i.e. when a user requests a Generator with an arbitrary state, presumably to then call the .seed() method on it. This might probably be pretty rare, as it would be about as easy to just explicitly instantiate it with the seeded BitGenerator that they actually want. A legitimate choice here might actually be to nominate no default and always require the user to specify a BitGenerator.

Nonetheless, we will have recommendations as to which BitGenerator people should use most of the time, and while we can change recommendations fairly freely, whichever one has pride of place will probably get written about most in books, blogs, tutorials, and such.

IMO, there are a few main options (with my commentary, please feel free to disagree; I have not attempted to port over all the relevant comments from #13163):

No default

Always require Generator(ChosenBitGenerator(maybe_seed)). This is a little unfriendly, but as it's a pretty convenient way to get the generator properly initialized for reproducibility, people may end up doing this anyways, even if we do have a default.

MT19937

This would be a good conservative choice. It is certainly no worse than the status quo. As the Mersenne Twister is still widely regarded as "the standard" choice, it might help academic users who need their papers to be reviewed by people who might question "non-standard" choices, regardless of the specific qualities of the PRNG. "No one ever got fired for hiring IBM." The main downsides of MT19937 are mostly that it is slower than some of the available alternatives, due to its very large state, and that it fails some statistical quality tests. In choosing another PRNG, we have an _opportunity_ (but not an _obligation_, IMO) to be opinionated here and try to move "the standard", if we wish.

PCG64

This is likely the one that I'll be using most often, personally. The main downside is that it uses 128-bit integer arithmetic, which is emulated in C if the compiler does not provide such an integer type. The two main platforms for which this is the case are 32-bit CPUs and 64-bit MSVC, which just does not support 128-bit integers even when the CPU does. Personally, I do not suggest letting the performance increasingly-rare 32-bit CPUs dictate our choices. But the MSVC performance is important, though, since our Windows builds do need that compiler and not other Windows compilers. It can probably be addressed with some assembly/compiler intrinsics, but someone would have to write them. The fact that it's _only_ MSVC that we have to do this for makes this somewhat more palatable than other times when we are confronted with assembly.

Xoshiro256

Another modern choice for a small, fast PRNG. It does have a few known statistical quirks, but they are unlikely to be a major factor for most uses. Those quirks make me shy away from it, but that's my personal choice for the code I'll be writing.

15 - Discussion numpy.random

Most helpful comment

Very much inspired by this thread, I have some news to report…

Background

By many measures pcg64 is pretty good; for example, under the usual measures of statistical quality, it gets a clean bill of health. It's been tested in various ways; most recently I've run it all the way out to half a petabyte with PractRand. It works well in normal use cases.

BUT, the pathologies that came up in this thread didn't sit well with me. Sure, I could say “well, don't hold it that way”, but the whole point of a general purpose PRNG is that it ought to robust. I wanted to do better...

So, about 25 days ago I began thinking about designing a new member of the PCG family…

Goal

My goal was to design a new PCG family member that could be a drop in replacement for the current pcg64 variant. As such:

  • The output function should scramble the bits more than XSL RR (because doing so will avoid the issues that came up in this thread).
  • The performance should be about as fast (or faster) than the current pcg64.
  • The design must be PCG-ish (i.e., don't be trivially predictable, and thus don't allow _any_ of the work of the output function to be easily undone).

As always there is a trade-off as we try to get the best quality we can as quickly as we can. If we didn't care at all about speed, we could have more steps in the output function to produce more heavily scrambled output, but the point of PCG was that the underlying LCG was “almost good enough” and so we didn't need to go to quite as much effort as we would with something like a counter incrementing by 1.

Spoiler

I'm pleased to report success! About 25 days ago when I was first thinking about this I was actually on vacation. When I got back about ten days ago, I tried the ideas I had and was pleased to find that they worked well. The subsequent time has mostly been spent on various kinds of testing. Yesterday was satisfied enough that I pushed the code into the C++ version of PCG. Tests at small sizes indicate that it is much better than XSL RR, and competitive with RXS M, but it actually shines at larger sizes. It meets all the other goals as well.

Details

FWIW, the new output function is (for the 64-bit output case):

uint64_t output(__uint128_t internal)
{
    uint64_t hi = internal >> 64;
    uint64_t lo = internal;

    lo |= 1;
    hi ^= hi >> 32;
    hi *= 0xda942042e4dd58b5ULL;
    hi ^= hi >> 48;
    hi *= lo;
    return hi;
}

This output function is inspired by xorshift-multiply, by which is widely used. The choice of multipliers is (a) to keep the number of magic constants down, and (b) to prevent the permutation being undone (if you don't have access to low-order bits), and also provide the whole “randomized-by-itself” quality that PCG output functions typically have.

Other changes

It's also the case that 0xda942042e4dd58b5 is the LCG multiplier for this PRNG (and all cm_ prefixed 128-bit-state PCG generators). As compared to 0x2360ed051fc65da44385df649fccf645 used by pcg64, this constant is actually still fairly good in terms of spectral-test properties, but is cheaper to multiply by because 128-bit × 64-bit is easier than 128-bit × 128-bit. I've used this LCG constant for several years without issue. When using the cheap-multiplier variant, I run the output function on the pre-iterated state rather than the post-iterated state for greater instruction-level parallelism.

Testing

I've tested it thoroughly (PractRand and TestU01) and I'm happy with it. Tests included scenarios outlined in this thread (e.g., taking a gang generators either on sequential steams or advanced by 2^64 and interleaving their output — I tested a gang of four and a gang of 8192 out to 8 TB with no issues, as well as a stream and its opposite-land counterpart).

Speed

I could go on at length about speed tests and benchmarks. There are all sorts of factors that influence whether one PRNG runs faster than another in a given benchmark, but overall, this variant seems to often be a little faster, sometimes a lot faster, and occasionally a little slower. Factors like the compiler and the application have a much greater impact on benchmark variability.

Availability

Users of the C++ header can access this new family member _now_ as pcg_engines::cm_setseq_dxsm_128_64; at some point in the future, I'll switch pcg64 from pcg_engines::setseq_xsl_rr_128_64 to this new scheme. My current plan is to do so this summer as part of a PCG 2.0 version bump.

Formal Announcements

Overall, I'm very happy with this new family member and at some point later in the summer, there will be blog posts with more detail, likely referencing this thread.

Your Choices...

Of course, you have to work out what to do with this. Regardless of whether you'd use it or not, I'd actually be pretty curious to see whether it does better or worse in your speed benchmarks.

All 166 comments

What does the Intel windows compiler do for 128 bit integers? How much slower is PCG64 compiled with MSVC compared to MT1993 on windows? I suspect that the jump ahead feature will be widely used, so it might be good to have it by default.

What does the Intel windows compiler do for 128 bit integers?

Not entirely sure; I don't know if there are ABI implications that ICC would care to be constrained by. If we just want to get any idea of the generated assembly that we could use, then this is a handy resource: https://godbolt.org/z/kBntXH

I suspect that the jump ahead feature will be widely used, so it might be good to have it by default.

Do you mean settable streams, rather? That's a good point, but I wonder if it might not cut the other way. If our choice of default actually matters much, then maybe if we pick one of these more fully-featured PRNGs, people will use those features more extensively in library code without documenting that they require those "advanced" features, because, after all, they are available "standard". But then if another user tries to use that library with a less-featured BitGenerator for speed or other reasons, then they'll hit a brick wall. In a No default or MT19937 world, libraries would be more likely to think about and document the advanced features that they require.

On the gripping hand, that eventuality would make the BitGenerators without settable streams look less desirable, and I _do_ like the notion of advancing what's considered to be best practice in that direction (purely personally; I don't feel an obligation to make NumPy-the-project share that notion). It might help avoid some of the abuses that I see with people .seed()ing in the middle of their code. But again, all that's predicated on the notion that having a default will change people's behaviors significantly, so all of these concerns are likely quite attenuated.

How much slower is PCG64 compiled with MSVC compared to MT1993 on windows?

In benchmarks posted by @bashtage in #13163, PCG64 is nearly half the speed of MT19937, which is pretty disappointing performance out of MSVC and friends. It's compared to 23% faster on Linux.

What does the Intel windows compiler do for 128 bit integers?

Other compilers like Clang, GCC, and the Intel compiler implement 128-bit integers on 64-bit systems the same way that they implemented 64-bit integers on 32-bit systems. All the same techniques with no new ideas needed. Microsoft didn't bother to do that for MSVC so there are no 128-bit integers directly supported by the compiler.

As a result, for MSVC the existing implementation of PCG64 in #13163 hand-implements 128-bit math by calling-out to Microsoft intrinsics like _umul128 in x86_64 (and it could presumably also use equivalent and more portable Intel intrinsics like _mulx_u64 instead), thereby coding what GCC, Clang and the Intel compiler would do by themselves. The biggest issue is likely that Microsoft's compiler doesn't optimize these intrinsics very well (hopefully they're at least inlined?). It's possible that hand-coded assembler might go faster but the proper fix would be for the compiler not do be so diabolically poor.

I suspect that the jump ahead feature will be widely used, so it might be good to have it by default.

I'm glad you like jump ahead, but I'm curious why you think it'd be widely used. (Personally, I really like distance, which tells you how far apart two PRNGs are. That's in the C++ version of PCG, but not the C one. It'd be trivial enough to add it though if there was interest.)

I'm glad you like jump ahead, but I'm curious why you think it'd be widely used.

Probably unfamiliarity with current terminology. What I mean is easily obtained independent streams that can be used to run simulations in parallel. I don't know how many simulation problems can be parallelized, but I suspect it is a lot and given the number of cores people get on a chip these days, that could easily make up for a speed disadvantage.

Microsoft didn't bother to do that for MSVC so there are no 128-bit integers directly supported by the compiler.

So that will hurt our wheels, OTOH, many folks on Windows get their packages from Anaconda or Enthought, both of which use Intel, and folks who really care about performance are probably on Linux, Mac, or maybe AIX.

EDIT: And perhaps if Microsoft is concerned, they could offer a bounty for fixing the problem.

FWIW, here's the assembly that clang would generate for the critical function, including the bits needed to unpack/repack the uint128_t into the struct of uint64_ts: https://godbolt.org/z/Gtp3os

Very cool, @rkern. Any chance you can do the same to see what MSVC is doing with the hand-written 128-bit code?

Very cool, @rkern. Any chance you can do the same to see what MSVC is doing with the hand-written 128-bit code?

It's, uh, not pretty. ~https://godbolt.org/z/a5L5Gz~

Oops, forget to add -O3, but it's still ugly: https://godbolt.org/z/yiQRhd

It's not quite that bad. You didn't have optimization on, so it didn't inline anything. I've added /Ox (maybe there's a better option?). I also fixed the code to use the built-in rotate intrinsic (_rotr64) since apparently MSVC is incapable of spotting the C rotate idiom.

Still kind-of a train wreck though. But I think it's fair to say that with a bit of attention, the PCG64 code could be tweaked to compile on MSVC into something that isn't utterly embarrassing for everyone.

In order to allow everything else to be merged, why not pick "no default" for now? That leaves us free to make the decision on the default later (even after one or more release s) without breaking compatibility.

Most of our users are not random number experts, we should be providing defaults for them.

Beyond the prosaic "now they need to type more code", what happens when we change something? In the case where the BitGenerator is hard coded, (because we did not provide a default), every non-sophisticated user will now have to refactor their code, and hopefully understand the nuances of their choice (note we cannot even agree amongst ourselves what is best). However if we provide a default, we might noisily break their tests because the new default or new version is not bit-stream compatible.

Between the assumption that the bit-stream will always be constant versus the assumption that the NumPy developers know what they are doing and the default values should be best-of-brand, I would err on the side of the second assumption, even if it breaks the first.

Edit: clarify which developers should know what they are doing

Most of our users are not random number experts, we should be providing defaults for them.

Well, we'll certainly be documenting recommendations, at the very least, regardless of whether or not we have a default or what the default is.

Beyond the prosaic "now they need to type more code", what happens when we change something?

What "something" are you thinking about? I can't follow your argument.

Beyond the prosaic "now they need to type more code", what happens when we change something?

What "something" are you thinking about? I can't follow your argument.

@mattip is referring to changing the default bit generator.

This woudl make users who have adopted it mad, and coudl require some code change.

For example, if you used

g = Generator()
g.bit_generator.seed(1234)

and the underlying bit generator was changed, then this would be wrong.

If you did the more sane thing and used

Generator(BitGenerator(1234))

then you would not see it.

IMO, When considering the choice of default, we should think it fixed until a fatal flaw is found in the underlying bit generator or Intel adds a QUANTUM_NI to its chips, which produces many OOM improvement in random performance.

I realize I'm a bit of an outsider here, but I don't think it's reasonable to expect that which PRNG is the default choice is fixed forever and never changes. (In C++, for example, std::default_random_engine is at the discretion of the implementation and can change from release to release.)

Rather, there needs to be a mechanism to reproduce prior results. Thus once a particular implementation exists, it is very uncool to change it (e.g., the MT19937 _is_ MT19937, you can't tweak it to give different output). [And it's also uncool to remove an implementation that already exists.]

When the default changes, people who want to keep reproducing old results will need to ask for the previous default by name. (You could make that by providing a mechanism to select the default corresponding to a prior release.)

That said, even if you're allowed to swap out the default generator for something else, it really needs to be strictly better — any features present in the default generator represent a commitment to support that feature in the future. If your default generator has efficient advance, you can't really take that away later. (You could potentially wall off advanced functionality in the default generator to avoid this issue.)

In summary, there are ways to make sure uses can have reproducible results without trying to lock yourselves into a contract where the default is unchanged forever. It'll also reduce the stakes for the choice you make.

(FWIW, this is what I did in PCG. The default PCG 32-bit PRNG is currently the XSH-RR variant [accessed as pcg_setseq_64_xsh_rr_32_random_r in the C library and the pcg_engines::setseq_xsh_rr_64_32 class in the C++ library], but in principle if you really want future-proof reproducibility you should specify XSH-RR explicitly, rather than use pcg32_random_r or pcg32 which are aliases and in principle can be upgraded to something else.)

It is not really forever (this entire project is 90% driven by a real, genuine, and honored forever promise made about 14 years ago), but as you say, switching needs (a) a compelling reason to change and (b) would take at least a few years give the depreciation cycle.

It is much better to try hard today to get it as close to right as possible.

One thing that isn't banned, of course, is improving PRNG code after release as logn as it produces the same values. For example, if we went with a PRNG that used uint128, we could let MS add uint128 support (fat chance) or add assembly for Win64 in a future version.

For example, if you used

g = Generator()
g.bit_generator.seed(1234)

and the underlying bit generator was changed, then this would be wrong.

Right, that seems to be arguing, with @eric-wieser, for the "No default" option, which I can't square with the initial statement "Most of our users are not random number experts, we should be providing defaults for them."

Between no default and a friendly, fully assuming default, I would always choose the latter:

Now:

Generator() # OK
Generator(DefaultBitGenerator(seed)) # OK
Generator(seed) # error

_my_ preference:

Generator(1234) == Generator(DefaultBitGenerator(1234)
Generator(*args**kwargs) == Generator(DefaultBitGenerator(*args, **kwargs))

Now I don't think this is going to get in, but I think one way to prolong the use of RandomState is to make this only available to users who feel they are expert enough to choose a bit generator.

In summary, there are ways to make sure uses can have reproducible results without trying to lock yourselves into a contract where the default is unchanged forever. It'll also reduce the stakes for the choice you make.

Yes, we have that. Users can grab BitGenerators by name (e.g. MT19937, PCG64, etc.) and instantiate them with seeds. BitGenerator objects implement the core uniform PRNG algorithm with a limited set of methods for drawing uniform [0..1) float64s and integers (as well as whatever fun jumpahead/stream capabilities they have). The Generator class that we are talking about takes a provided BitGenerator object and wraps around it to provide all of the non-uniform distributions, the Gaussians, the gammas, the binomials, etc. We have strict stream compatibility guarantees for the BitGenerators. We won't be getting rid of any (that make it to release), and nor will be changing them.

The central question about the default is "What does the code g = Generator(), with no arguments, do?" Right now, in the PR, it creates a Xoshiro256 BitGenerator with an arbitrary state (i.e. drawn from a good entropy source like /dev/urandom). The "No default" option would be to make that an error; users would have to explicitly name the BitGenerator that they want. @eric-wieser's point is that "No default" is a categorically _safe_ option for the first release. A later release providing a default won't cause problems in the same way that changing an existing default does.

@rkern, If you only care about the no arguments case where the seed is autogenerated from available entropy, then it really doesn't matter much what the underlying generator is — it could change on an hourly basis since the results would never be reproducible (different runs would get different seeds).

In contrast, @bashtage seems to care about a default generator that's provided with a seed.

@rkern, If you only care about the _no arguments_ case where the seed is autogenerated from available entropy, then it really doesn't matter much what the underlying generator is — it could change on an hourly basis since the results would never be reproducible (different runs would get different seeds).

You can reseed the BitGenerator after it's created. So if Generator() works, what I'm fully expecting to happen is that people who want a seeded PRNG will just seed it in the next line, as in @bashtage's example:

g = Generator()
g.bit_generator.seed(seed)

That's somewhat tedious, which is why I suggested at top that maybe most people would usually opt for Generator(PCG64(<seed>)) anyways, since it's just about as convenient typing-wise. However, @bashtage correctly notes some resistance when faced with making an extra decision.

So I guess we _also_ have a broader question in front of us: "What are all the ways that we want users to instantiate one of these? And if those ways have default settings, what should those defaults be?" We have some open design space and @bashtage's suggestion for Generator(<seed>) or Generator(DefaultBitGenerator(<seed>)) are still possibilities.

@bashtage How much do you think documentation would help? That is, if we said at the top "PCG64 is our preferred default BitGenerator" and used Generator(PCG64(seed)) consistently in all examples (when not specifically demonstrating other algorithms)?

I might be more convinced to have a default_generator(<seed>) _function_ over Generator(<seed>) or g=Generator();g.seed(<seed>). Then if we really needed to change it and didn't want to break stuff, we could just add a new function and add warnings to the old one. I might recommend marking it experimental for the first release, giving us some time to watch this infrastructure in the wild before making a firm commitment.

What about actually making a DefaultBitGenerator object that doesn't expose any details of its internal state? This would be a proxy for one of the other bit generator objects, but would in principle could be wrapping any of them -- except of course for its specific sequence of generated numbers. This would hopefully discourage users from making programmatic assumptions about what they can do with the default BitGenerator, while allowing us to still use an improved algorithm.

I agree with @bashtage that it would much more friendly to directly support integer seeds as arguments to Generator, e.g., np.random.Generator(1234). This would, of course, make use of DefaultBitGenerator.

In documentation for Generator, we could give a full history of what the default bit generator was in each past version of NumPy. This is basically @imneme's suggestion, and I think would suffice for reproducibility purposes.

(Just saw this edit to an earlier comment)

Oops, forget to add -O3, but it's still ugly: https://godbolt.org/z/yiQRhd

For MSVC, it's not -O3, it's /O2 or /Ox (but not /O3!).

In documentation for Generator, we could give a full history of what the default bit generator was in each past version of NumPy. This is basically @imneme's suggestion, and I think would suffice for reproducibility purposes.

Actually, even better would be to include an explicit version argument, like pickle's protocol argument, in Generator/DefaultBitGenerator. Then you could write something like np.random.Generator(123, version=1) to indicate that you want "version 1" random numbers (whatever that is) or np.random.Generator(123, version=np.random.HIGHEST_VERSION) (default behavior) to indicate that you want the latest/greatest bit generator (whatever that is).

Presumably version=0 would be the MT19937 that NumPy has used up to now, and version=1 could be whatever new default we pick.

What about actually making a DefaultBitGenerator object that doesn't expose any details of its internal state? This would be a proxy for one of the other bit generator objects, but would in principle could be wrapping any of them -- except of course for its specific sequence of generated numbers. This would hopefully discourage users from making programmatic assumptions about what they can do with the default BitGenerator, while allowing us to still use an improved algorithm.

Hmmm. That's appealing. It _feels_ like it's overcomplicating things and adding another loop to this Gordian knot (and that there _should_ be a more Alexander-esque stroke available to us), but that's really the only bad thing I have to say about it. It _does_ make the remaining decisions easier: we can focus on statistical quality and performance.

Actually, even better would be to include an explicit version argument, like pickle, in Generator/DefaultBitGenerator.

I'm less a fan of this. Unlike the pickle case, these things have meaningful names that we can use, and we already have the mechanism implemented.

I'm less a fan of this. Unlike the pickle case, these things have meaningful names that we can use, and we already have the mechanism implemented.

Consider the following from the perspective of a typical NumPy user:

  • np.random.Generator(seed, version=0) vs np.random.Generator(seed, version=1)
  • np.random.Generator(MT19937(seed)) vs np.random.Generator(PCG64(seed))

I think it's safe to assume that most of our users know very little about the relative merits of RNG algorithms. But even without reading any docs, they can safely guess that version=1 (the newer default) must be better in most cases than version=0. For most users, that's really all they need to know.

In contrast, names like MT19937 and PCG64 are really only meaningful for experts, or people who have already read our documentation :).

In your use case, no one is selecting the version that they _want_. They only select the version that they _need_ in order to replicate the results of a known version. They are always looking for a specific value that was used (implicitly, because we allowed it to be implicit) in the results that they want to replicate; they don't need to reason about the relationship between multiple values.

And in any case, that level of cross-release reproducibility is something that we've disclaimed in NEP 19. The arguments against versioning the distributions apply just as well here.

A few thoughts on the default:

  • 99.9% of users won't care or want to know about underlying algorithms, they just want random numbers. So +1 for making an opinionated choice for the default, please don't make users choose.
  • dSFMT seems to be simply a faster version than MT19937 (would be nice to state in the docs how fast and remove "SSE2"). Since we're not guaranteeing bitstream reproducibility anyway, the internal state differences are not very interesting and dSFTM should be preferred over MT19937 even if the winning argument here is "make life easier during article review".
  • Performance matters to a significant fraction of the user base. Statistical properties of generators only matters to a _very_ small fraction of users. All included generators are fine for normal use cases. So +1 for choosing the fastest one as default.

Sorry to say - but 32-bit still matters on Windows - see https://github.com/pypa/manylinux/issues/118#issuecomment-481404761

I think we should care a lot about statistical properties, because we are in the process of a big shift towards greater use of resampling methods in data analysis. If Python gets a reputation for being a bit sloppy on this matter, even if it's just by default, that might well be a barrier to uptake by people considering Python for data analysis. I would be very pleased if Python were the package of choice for people who take permutation and simulation seriously.

I think it's fine to offer faster not-state-of-art algorithms, but not by default, to the extent we can avoid it and maintain back-compatibility.

For some forensics and discussion, see: https://arxiv.org/pdf/1810.10985.pdf

Textbooks give methods that implicitly or explicitly assume that PRNGs can be substituted for true IIDU[0,1)variables without introducing material error[20, 7, 2, 16, 15]. We show here that this assumption is incorrect for algorithms in many commonly used statistical packages, including MATLAB, Python’s random module, R, SPSS, and Stata.

@kellieotto, @pbstark - do y'all have an opinion about what PRNG we should choose here, to give the best possible basis for permutation and bootstrap?

I think we should care a lot about statistical properties, because we are in the process of a big shift towards greater use of resampling methods in data analysis

Agreed. As long as those properties are relevant for some real-world use cases, that's very important. The concerns that are usually brought up are always extremely academic.

For some forensics and discussion, see: https://arxiv.org/pdf/1810.10985.pdf

Very interesting article. It does conclude that NumPy is about the only library that gets it right (top of page 9), unlike R, Python stdlib & co.

It would be very useful to get even more concrete examples than in the paper. If our current default generator also breaks down at some point, when is that? Examples like R's sample function generating 40% even numbers and 60% odd numbers when drawing ~1.7 billion samples. What's the bootstrapping/resampling equivalent here?

The latest release of R (3.6) fixes the truncation vs. random bits approach
to generating random integers. The Mersenne Twister remains the default
PRNG, though.

@Kellie Ottoboni kellieotto@berkeley.edu and I think the default PRNG in
scientific languages and statistics packages should be cryptographically
secure (a CS-PRNG, e.g., SHA256 in counter mode), with the option to fall
back to something faster but of lower quality (e.g., the Mersenne Twister)
if speed requires it.

We've been working on a CS-PRNG for Python:
https://github.com/statlab/cryptorandom

Performance isn't great (yet). The bottleneck seems to be type conversion
within Python to cast binary strings (hash output) as integers. We're
working on an implementation that moves more of the work to C.

Cheers,
Philip

On Mon, May 27, 2019 at 6:27 AM Ralf Gommers notifications@github.com
wrote:

I think we should care a lot about statistical properties, because we are
in the process of a big shift towards greater use of resampling methods in
data analysis

Agreed. As long as those properties are relevant for some real-world use
cases, that's very important. The concerns that are usually brought up are
always extremely academic.

For some forensics and discussion, see:
https://arxiv.org/pdf/1810.10985.pdf

Very interesting article. It does conclude that NumPy is about the only
library that gets it right (top of page 9), unlike R, Python stdlib & co.

It would be very useful to get even more concrete examples than in the
paper. If our current default generator also breaks down at some point,
when is that? Examples like R's sample function generating 40% even
numbers and 60% odd numbers when drawing ~1.7 billion samples. What's the
bootstrapping/resampling equivalent here?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWJEIA4CTLLHVGZVKBLPXPOUFA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWJZW4Y#issuecomment-496212851,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANFDWJW445QDPGZDGXMPA3PXPOUFANCNFSM4HPX3CHA
.

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor, Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

Performance matters to a significant fraction of the user base. Statistical properties of generators only matters to a _very_ small fraction of users. All included generators are fine for normal use cases. So +1 for choosing the fastest one as default

First, there is simply no way to pick “the fastest”. @bashtage ran some benchmarks on the current code in #13163 and it was all over the map, with dSFMT winning on Windows and being soundly beaten by PCG-64 and Xoroshiro 256 on Linux. And this is all on the same machine with the same benchmark. Different hardware architecture (even revisions within X86) will make a difference as will different benchmarks. (As already discussed in this thread, PCG does poorly in the Windows benchmarks because of issues with MSVC, which is also likely to be a transient thing, since MSVC may improve or people may work around its issues. Probably similar MSVC issues explain why Xoshiro was beaten.)

I also wonder just how big the “significant fraction” of users who care about speed is. Python itself averages out to be about 50× slower than C. What fraction of the NumPy userbase is running it on PyPy (which would give a 4× speed boost)? Some, certainly, but I suspect not a very high number.

And for that “significant fraction” who do care about speed, given all the variability outlined above, who is going to just take your word for it that the default PRNG will run fastest for their application? A sensible thing to do (that is also quite fun and within the reach of most users) is to benchmark the different available PRNGs and see which one is fastest _for them_.

In contrast, although they may find clues in the documentation, figuring out the statistical quality particular PRNGs is, as you note, not on the radar of most users (and is challenging even for experts). Most won't even know when/whether they should care or not. I'd argue that this is a place for some paternalism — the fact that most users don't care about something doesn't mean that the maintainers shouldn't care about it.

It's true that all the included PRNGs are fine for most use cases, but that is a fairly low bar. Unix systems have shipped with smorgasbord of C-library PRNGs that are all statistically terrible and yet they've been widely used for years without the world spinning off its axis.

Beyond statistical properties, there are other properties that user might not know to want for themselves but I might want for them. Personally, as a provider of PRNGs I want to avoid trivial predictability — I don't want someone to look a few outputs from the PRNG and then be able to say what all future outputs will be. In most contexts where NumPy is used, predictability is not an issue — there is no adversary who will benefit from being easily able to predict the sequence. But someone somewhere is going to use NumPy's PRNGs not because they need NumPy to do statistics, but because that's where they've found PRNGs before; that code may face an actual adversary who will benefit from being able to predict the PRNG. Paying a lot (e.g., significant loss of speed) to robustly insure against this outlier situation isn't worth it, but modest insurance might be worth it.

For some forensics and discussion, see: https://arxiv.org/pdf/1810.10985.pdf

Textbooks give methods that implicitly or explicitly assume that PRNGs can be substituted for true IIDU[0,1)variables without introducing material error[20, 7, 2, 16, 15]. We show here that this assumption is incorrect for algorithms in many commonly used statistical packages, including MATLAB, Python’s random module, R, SPSS, and Stata.

FWIW, there is a nice paper by @lemire on efficiently generating a number in a range without bias. I used that as a jumping off point to explore and run some benchmarks too in my own article. (When generating 64-bits, Lemire's method does use 128-bit multiplication to avoid slow 64-bit division, with all the familiar issues that might raise for MSVC users.)

@pbstark @kellieotto I read your paper with interest when it showed up on arXiv. I was visiting some friends at BIDS, and they had mentioned your work. The Discussion section notes that "so far, we have not found a statistic with consistent bias large enough to be detected in O(10^5) replications" for MT19937. Have you found one yet? Have you found a a concrete example for a 128-bit-state PRNG like PCG64? That seems to me to be a reasonable threshold of practical relevance, where this consideration might start to outweigh others (IMO), at least for the purpose of selecting a general-purpose default.

The nice feature of our new PRNG framework #13163 is that it allows anyone to provide their own BitGenerator that can just be plugged in. It doesn't even have to be in numpy for people to use it in their numpy code. I would encourage you to look at implementing cryptorandom as a BitGenerator in C so we can compare it head to head with the other options.

Personally, I expect those that really care about speed go the extra mile if necessary (it is not much here). We should provide safe defaults, and my current best guess is that this means safe defaults for all purposes with the exception of cryptography (we probably should have a warning about that in the docs). Many users care about speed, but frankly that is exactly why I shy away from giving it too high priority.
That article where Numpy did well seems interesting (kudos to Robert probably for getting it right!), but actually is the sampler not the bit generator.

@pbstark maybe you would want to implement this as a BitGenerator compatible with numpy/randomgen? That is likely both be the easiest way to speed it up and make it available to a wide audience in a much more useful form. Since it seems you and Kellie Ottoboni are in Berkeley, we could meet up some time to get that going? (Just an offer, I should have a closer look at the code myself first).

Regarding that _Random Sampling: Practice Makes Imperfect_ paper, it's a nice read, but it's worth remembering that if we had 1 trillion cores producing one number per nanosecond for 100 years, we would generate fewer than 2^102 numbers.

For trivially predictable PRNGs (even ones with large state spaces like the Mersenne Twister), we can actually know whether some specific output sequence can ever be produced (and find the seed that would generate it if it exists or feel wistful if it doesn't), but for other non-trivially-predictable PRNGs we can't (easily) know which output sequences can never produced and which ones are there but sufficiently rare that we're vanishingly unlikely to ever find them in an eon of searching. (As you may know, I have a PRNG that I know will spit out a zip file with _Twelth Night_ in it within 2^80 outputs, but good luck ever finding it.)

If you really want a cryptoprng then the only choice on modern hardware is
AES since it has a dedicated instruction. @lemire has an implementation
here https://github.com/lemire/testingRNG/blob/master/source/aesctr.h that
is as fast as noncrypto generators. There is also ChaCha20 which can go
fast with SIMD. Both will be dog slow on old hardware though. ThreeFry and
Philox are already included and are cryptolite counter prngs.

IMO crypto is overrated in terms of cost benefits. I'm not aware of any
important retraction due to PRNG problems with Mt, which I reckon has been
used in the order to 10e6 published papers. The only applications I've seen
where the PRNG was really problematic were cases where the period was so
small that the generator completed the full cycle. Even here the only
effect was reducing the sample size of the study, which replicated the main
results once rerun on a system with a larger period.

On Mon, May 27, 2019, 19:50 Robert Kern notifications@github.com wrote:

@pbstark https://github.com/pbstark @kellieotto
https://github.com/kellieotto I read your paper with interest when it
showed up on arXiv. I was visiting some friends at BIDS, and they had
mentioned your work. The Discussion section notes that "so far, we have not
found a statistic with consistent bias large enough to be detected in
O(10^5) replications" for MT19937. Have you found one yet? Have you found a
a concrete example for a 128-bit-state PRNG like PCG64? That seems to me to
be a reasonable threshold of practical relevance, where this consideration
might start to outweigh others (IMO), at least for the purpose of selecting
a general-purpose default.

The nice feature of our new PRNG framework #13163
https://github.com/numpy/numpy/pull/13163 is that it allows anyone to
provide their own BitGenerator that can just be plugged in. It doesn't
even have to be in numpy for people to use it in their numpy code. I would
encourage you to look at implementing cryptorandom as a BitGenerator in C
so we can compare it head to head with the other options.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=ABKTSRO5PKW4MRFSBGUFUNTPXQUOLA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKLICI#issuecomment-496284681,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKTSRMRIHC4OYDR52HLTHDPXQUOLANCNFSM4HPX3CHA
.

I also wonder just how big the “significant fraction” of users who care about speed is. Python itself averages out to be about 50× slower than C. What fraction of the NumPy userbase is running it on PyPy (which would give a 4× speed boost)? Some, certainly, but I suspect not a very high number.

I suspect you're not a regular user:) NumPy is mostly C under the hood, and is as fast as doing your own thing in C (well, faster mostly). Also, PyPy is not production ready for scientific applications, and slower in any case (because it is limited to using the CPython API that NumPy uses, so it cannot gain the benefits of its JIT).

Either way, this is off-topic. Asserting that speed matter is not controversial.

@imneme we are using lemires method for bounded integers. Since this a
fresh start with no legacy or depreciation we have tried hard to start with
good algorithms.

On Mon, May 27, 2019, 19:46 imneme notifications@github.com wrote:

For some forensics and discussion, see:
https://arxiv.org/pdf/1810.10985.pdf

Textbooks give methods that implicitly or explicitly assume that PRNGs can
be substituted for true IIDU[0,1)variables without introducing material
error[20, 7, 2, 16, 15]. We show here that this assumption is incorrect for
algorithms in many commonly used statistical packages, including MATLAB,
Python’s random module, R, SPSS, and Stata.

FWIW, there is a nice paper https://arxiv.org/abs/1805.10941 by @lemire
https://github.com/lemire on efficiently generating a number in a range
without bias. I used that as a jumping off point to explore and run some
benchmarks too in my own article
http://www.pcg-random.org/posts/bounded-rands.html. (When generating
64-bits, Lemire's method does use 128-bit multiplication to avoid slow
64-bit division, with all the familiar issues that might raise for MSVC
users.)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=ABKTSRKNAQAK4WIXG5SVLO3PXQUA3A5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKLDRI#issuecomment-496284101,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKTSRKV3KYKRLNMBKNU4JLPXQUA3ANCNFSM4HPX3CHA
.

We should provide _safe_ defaults, and my current best guess is that this means safe defaults for all purposes with the exception of cryptography

That's hard to argue with. My question is - what is safe? There's just varying degrees of quasi-randomness with various properties. So far I have not seen anyone give a concrete example, neither here not in other issues, PRs or threads. Just talking about abstract statistical properties isn't helpful.

My sense is that PCG64 would be a good default. The speed disadvantage on Windows will not be apparent to folks using Anaconda et. al., and is likely to be fixed at some point. With parallel execution being the new thing in Python, I also think having settable streams is a desirable property.

I am highly skeptical that the PCG64 speed penalty under Visual Studio is something that cannot be wiped away.

Was this carefully assessed somewhere?

Asserting that speed matter is not controversial.

My question is - what is safe?

Apply the logic consistently: "what is fast"? I don't have a great idea what numpy programs actually have the performance of the BitGenerator as a significant bottleneck. If I use a BitGenerator that's twice as fast, will I get a 5% speed-up in my full calculation? Probably not even that. Python-not-being-as-fast-as-C is not the issue; it's just that even PRNG-heavy programs that are actually useful don't spend a huge amount of time in the BitGenerator. Probably any of the available choices are sufficient.

I am highly skeptical that the PCG64 speed penalty under Visual Studio is something that cannot be wiped away.

Up-thread I show how clang compiles PCG64 into assembly that we can steal for 64-bit MSVC, so no I don't think MSVC on 64-bit Windows is an insurmountable problem.

What may be trickier is PCG64 on 32-bit systems, of which only 32-bit Windows may still be practically important for us. In that case it's less about MSVC than about restricting ourselves to the 32-bit ISA.

What @Kellie Ottoboni kellieotto@berkeley.edu and I point out is that
even for modest-size problems, MT's state space is too small to approximate
uniform permutations (n<2100) or uniform random samples (n=4e8, k=1000).

That affects everything from the bootstrap to permutation tests to MCMC.
The difference between the intended distribution and the actual
distribution can be arbitrarily large (total variation distance approaching
2). It's big and it's serious.

We haven't put any effort into breaking MT on "statistical" functions in a
couple of years. I'm pretty sure there's a systematic way to break it
(since the distributional distances are so large).

Cheers,
Philip

On Mon, May 27, 2019 at 12:26 PM Robert Kern notifications@github.com
wrote:

Asserting that speed matter is not controversial.

My question is - what is safe?

Apply the logic consistently: "what is fast"? I don't have a great idea
what numpy programs actually have the performance of the BitGenerator as
a significant bottleneck. If I use a BitGenerator that's twice as fast,
will I get a 5% speed-up in my full calculation? Probably not even that.
Python-not-being-as-fast-as-C is not the issue; it's just that even
PRNG-heavy programs that are actually useful don't spend a huge amount of
time in the BitGenerator. Probably any of the available choices are
sufficient.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWKSMPAG3GFUCUFRXCDPXQYVRA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKMV3Q#issuecomment-496290542,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANFDWIDCPAJJ6DJ3RO332LPXQYVRANCNFSM4HPX3CHA
.

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor, Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

@pbstark What I'd like to see is a concrete implementation of problem (could be artificial, but not too contrived) on which MT or a 128-bit PRNG fails and cryptorandom would work on. Can you point out a dataset out there where the resampling method gives wrong inferences with a 128-bit PRNG and correct inferences with cryptorandom?

Moving to PCG64 makes the lower bound on the size of the problem worse,
since its state space is even smaller than that of MT. Of course, it could
still produce "better" randomness in that it might sample a subgroup of the
permutation group more evenly than MT does. But it has to break down before
500 choose 10, and before 21!.

Cheers,
Philip

On Mon, May 27, 2019 at 12:30 PM Robert Kern notifications@github.com
wrote:

I am highly skeptical that the PCG64 speed penalty under Visual Studio is
something that cannot be wiped away.

Up-thread I show how clang compiles PCG64 into assembly that we can steal
for 64-bit MSVC, so no I don't think MSVC on 64-bit Windows is an
insurmountable problem.

What may be trickier is PCG64 on 32-bit systems, of which only 32-bit
Windows may still be practically important for us. In that case it's less
about MSVC than about restricting ourselves to the 32-bit ISA.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWJFCINQCYGFCI7ULI3PXQZGLA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKM3PI#issuecomment-496291261,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANFDWK6QTB65Z4TJU76XKTPXQZGLANCNFSM4HPX3CHA
.

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor, Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

I do not know enough about PRNGs to really weigh in any case, I just want the focus to be on the statistical properties first (if the answer is that they are all very very good, fine). One thing that I wonder now is the k-dimensional equidistribution. Do we currently use variants of say PCG that do well here compared to MT? (Coming from nonlinear dynamics, that makes me a bit nervous, but I do not have enough overview over PRNGs and I will not get it in the next 2 days...

It seems unlikely that there are many Windows 32-bit users out there who care about cutting edge performance. It doesn't take much effort to switch to 64-bits.

I'd like to see it too.

We know--on the basis of the math--that there must be many large problems,
but we can't point to an example yet.

The precautionary principle would say that since we know there are large
problems and we know how to prevent them (CS-PRNGs), we might as well do
that by default, and let users be less cautious if they choose to be.

On Mon, May 27, 2019 at 12:39 PM Robert Kern notifications@github.com
wrote:

@pbstark https://github.com/pbstark What I'd like to see is a concrete
implementation of problem (could be artificial, but not too contrived) on
which MT or a 128-bit PRNG fails and cryptorandom would work on. Can you
point out a dataset out there where the resampling method gives wrong
inferences with a 128-bit PRNG and correct inferences with cryptorandom?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWODTUIPNMVOJB6QP3DPXQ2FPA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKNFCI#issuecomment-496292489,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANFDWITAGQFZDQSIFNEHETPXQ2FPANCNFSM4HPX3CHA
.

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor, Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

k-equidistribution is an ensemble property of PRNG outputs over the entire
period of the PRNG. It's a good thing, but it says nothing about other
kinds of failures of randomness, such as serial correlation of outputs.
It's a relatively low bar.

On Mon, May 27, 2019 at 12:48 PM Sebastian Berg notifications@github.com
wrote:

I do not know enough about PRNGs to really weigh in any case, I just want
the focus to be on the statistical properties first (if the answer is that
they are all very very good, fine). One thing that I wonder now is the
k-dimensional equidistribution. Do we currently use variants of say PCG
that do well here compared to MT? (Coming from nonlinear dynamics, that
makes me a bit nervous, but I do not have enough overview over PRNGs and I
will not get it in the next 2 days...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWPOBTPYHC3XBINQYA3PXQ3HZA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKNPXQ#issuecomment-496293854,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANFDWOEB7KR2YJZWHRRAHLPXQ3HZANCNFSM4HPX3CHA
.

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor, Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

@pbstark MT fails a number of statistical tests that PCG (and other generators) passes.

@rkern

If one wants MSVC to generate the ror instruction, I think one needs to use the "_rotr64" intrinsic.

Also one might prefer the '/O2' flag for optimization.

Looking at it, it might indeed be best to write it up in assembly, if one wants to use PCG64.

For @pbstark, here's some output from the PCG-64 initialized with a seed unknown to you (in fact, I'll even tell you the stream, it's 0x559107ab8002ccda3b8daf9dbe4ed480):

  64bit: 0x21fdab3336e3627d 0x593e5ada8c20b97e 0x4c6dce7b21370ffc
     0xe78feafb1a3e4536 0x35a7d7bed633b42f 0x70147a46c2a396a0
  Coins: TTTHHTTTHHTHTTTTHTHHTTTTTHTTHHTTHHTHHTHHHHHHHHTTHHTTHHTHHHHHTHTHH
  Rolls: 5 3 5 2 5 3 1 6 6 5 4 4 5 5 5 6 2 3 5 3 2 3 2 5 6 2 4 6 2 3 4 6 3
  Cards: 5h 3h 3c 8d 9h 7s Kh Ah 5d Kc Tc 6h 7h 8s Ac 5c Ad Td 8c Qd 2h As
     8h 2d 3s 5s 4d 6d 2s Jd 3d 4h Ks 6s Qc Js Th 9d 9c Ts Jh 4c 2c 9s
     6c 4s 7c 7d Jc Qs Kd Qh

Now, let's suppose you initialize another pcg generator with a randomly chosen seed. Let's pick, for sake of argument 0xb124fedbf31ce435ff0151f8a07496d3. How many outputs must we generate before we discover this known output? Because I know the seed I used above I can answer that (via PCG's distance function), about 2.5 × 10^38 (or about 2^127.5) outputs. For reference, 10^38 nanoseconds is 230 billion times the age of the universe.

So there's a sequence in PCG-64 that's really in there, but, practically speaking you'll never find it unless I tell you where to look. (And there would be even more possibilities if we vary the stream.)

Regular PCG has actually zero chance of outputting a Shakespeare Play; the PCG extended generation scheme actually can output a Shakespeare Play, but the chance of it ever doing so in a non-contrived scenario is so infinitesimal it is essentially zero as well. In my eyes, there is very little value in a property that has no practical consequence whatsoever.

(Also, cryptographically secure PRNGs are not guaranteed to be k-dimensionally equidistributed, nor are they a magic bullet for people who want PRNGs that can generate every possible sequence. The moment you want more bits out of a PRNG than it takes in as its seed and stores as its state, there are necessarily some bit sequences that cannot be generated (proof: by the pigeon-hole principle). And if you limit yourself to the same amount of output as you put in as the seed, what you're really looking for is a hash function, or maybe just the identity function if your seed input is truly random, not a PRNG.)

Out of curiosity, I wrapped up an AES counter bit generator using aesctr.h, time in ns per random value:

+---------------------+--------------+----------+--------------+
|                     |   Xoshiro256 |    PCG64 |   AESCounter |
|---------------------+--------------+----------+--------------+
| 32-bit Unsigned Int |      3.40804 |  3.59984 |      5.2432  |
| Uniform             |      3.71296 |  4.372   |      4.93744 |
| 64-bit Unsigned Int |      3.97516 |  4.55628 |      5.76628 |
| Exponential         |      4.60288 |  5.63736 |      6.72288 |
| Normal              |      8.10372 | 10.1101  |     12.1082  |
+---------------------+--------------+----------+--------------+

Nice work, @bashtage.

A few things to bear in mind, one is that the specific AES instructions may vary across architectures and are not present in all actively used CPUs, thus there needs to be a (slow) fallback path.

Also, it's a bit of an apples to oranges comparison. In addition to using specialized instructions, the AES code is getting a chunk of its speed from loop unrolling — it's actually is generating numbers in blocks and then reading them out. Unrolling can potentially speed up any PRNG. FWIW, @lemire actually has a vectorized version of PCG that uses AVX instructions to generate multiple outputs at once.

Let me see if I can summarize at least one point of consensus: We all agree that numpy should be opinionated about which BitGenerator algorithm to use, and to promote one BitGenerator over others.

Allow me one more stab at sketching out a "No default" option that concords with that consensus and avoids some of the issues that some of the other options might have. If it gets no traction, I'll shut up about it.

What I really meant by the "No default" option was "No anonymous default". There are still ways that we can design the API such that the most convenient way to get a seeded Generator is one that names the PRNG that we nominate. For example, let's say that we _don't_ include a full range of BitGenerator algorithms. We try to keep numpy fairly minimal and leave the completionism to scipy and other third-party libraries, in general, and it may be a good idea to do so here. The beauty of the current architecture is that it allows us to move those BitGenerators out to other libraries. So let's say that we only provide MT19937 to support the legacy RandomState and the one BitGenerator that we prefer people to use. For the sake of argument, let's say that's Xoshiro256. Let's make the Generator.__init__() constructor require a BitGenerator. But also, let's define a function np.random.xoshiro256_gen(seed) that returns Generator(Xoshiro256(seed)) under the covers. We document that convenience function as the way to get a seeded Generator.

Now fast-forward a few releases. Let's say that we pushed off PCG64, ThreeFry, etc. off to random-tng or scipy or some other package, and one of them becomes popular because of the extra features or new statistical flaws are found in Xoshiro256. We decide that we want to update numpy's opinion about which BitGenerator people should use to PCG64. Then what we do is add the PCG64 BitGenerator class and add the np.random.pcg64_gen(seed) function. We add a deprecation warning to np.random.xoshiro256_gen(seed) to say that it's no longer the preferred algorithm: we recommend that new code should use np.random.pcg64_gen(seed), but to continue to use the Xoshiro256 algorithm without warnings, they should explicitly use Generator(Xoshiro256(seed)).

I think this avoids some of the problems that I have with an "anonymous" API (i.e. Generator(seed), Generator(DefaultBitGenerator(seed)), np.random.default_gen(seed)). We can support the no-longer-preferred algorithms in perpetuity. We'll never need to make our opinionated "preferred" constructor do something different when we change our opinion. Because we use the real names to distinguish things rather than version numbers, you always know how to update the code to reproduce old results (if you can't stand the harmless warnings for some reason). You can even pick up code with no provenance or recorded numpy release numbers and do that update. At the same time, by limiting the number of algorithms to an absolute minimum and making the best practice the _easiest_ way to work with the library, we are able to express numpy's opinion effectively.

How does that sound? We should still put a lot of effort into making a solid choice for that BitGenerator for the first release. It's still consequential.

It seems unlikely that there are many Windows 32-bit users out there who care about cutting edge performance. It doesn't take much effort to switch to 64-bits.

I agree. As the issue with PCG64 on Win32 is simply performance (and one that we can probably ameliorate with some effort), would you agree that it's not a blocker?

If one wants MSVC to generate the ror instruction, I think one needs to use the "_rotr64" intrinsic.

Also one might prefer the '/O2' flag for optimization.

Thanks! @imneme pointed out all of these blunders for me. :-)

@seberg Do you have any citations from your field that led to your wariness? E.g. a paper that showed that the k=623 equidistribution property of MT19937 fixed problems in nonlinear dynamics simulations that a smaller PRNG caused? I might be able to provide some more specific assurance with that as a reference. In _general_, my view on equidistribution is that you generally want the equidistribution of the PRNG to be close to the maximum allowed by the PRNG's state size. In _practice_, if your PRNG is large enough for your purposes in other regards (passes PractRand, has a period larger than the square of the number of samples you plan to draw, etc.), I have never seen much reason to worry about the precise k. Others may have different opinions, and maybe there are specific issues in your field that I'm not aware of. If that's the case, then there are specific solutions available!

Out of curiosity, I wrapped up an AES counter bit generator using aesctr.h

I could be wrong, but I don't believe that this would help with @pbstark's concerns. AES-CTR is a CS-RNG, but not all CS-RNGs have the large periods that one would need to (theoretically) be able to reach all possible k! permutations for sizable k. The counter is still a 128-bit number and once it rolls over, you've reached the end of the period. @pbstark is advocating for very-large-period PRNGs, most of which just happen to be CS-RNGs.

In general, my view on equidistribution is that you generally want the equidistribution of the PRNG to be close to the maximum allowed by the PRNG's state size.

Although some consider maximal equidistribution a desirable property, it can also be considered a flaw (and there are papers out there saying as much). If we have a _k_-bit PRNG and every _k_-bit sequence occurs exactly once, then that will end up violating the birthday problem which says that we'd expect to see an output repeat after about 2^(k/2) outputs. (I wrote a birthday problem statistical test based on these ideas. It correctly detected the statistically implausible absence of any repeats in SplitMix, a 64-bit-output 64-bit-state PRNG, and Xoroshiro64+, 32-bit-output 64-bit-state 2-dimensionally-equidistributed PRNG, amongst others.)

Interestingly, although it is very practical to write a statistical test that will fail a PRNG for lack of 64-bit repeats (or too many repeats — we're expecting a Poisson distribution), it is conversely _not_ practical to write a test that will detect the omission of 36.8% of all 64-bit values if we know nothing about which ones are omitted.

Obviously testing for the lack-of-expected-repeats flaw begins to be impractical to run as _k_ gets larger, but as we get to larger and larger state size (and period), the added size means that it is both impractical to show that a maximally equidistributed PRNG is flawed for failing to repeat and equally impractical to show that a non-maximally equidistributed PRNG is flawed for repeating some _k_-bit sequences (in a statistically plausible way) and omitting others entirely. In both cases, the PRNG is too big for us to be able to distinguish the two.

I'd like to see it too. We know--on the basis of the math--that there must be many large problems, but we can't point to an example yet. The precautionary principle would say that since we know there are large problems and we know how to prevent them (CS-PRNGs), we might as well do that by default, and let users be less cautious if they choose to be.

I have to say that I'm not persuaded by this line of argument and evidence. As such, I would not feel comfortable standing front of numpy users and tell them that they should switch for this reason. I am unequipped to defend this statement, which is why I am asking for these examples. Those would be persuasive and would prepare me to defend this position that you are recommending that we take.

There are lots of properties by which finite PRNGs fall short of true RNGs. And there are a lot of computations that we want to do that, theoretically, depend on those properties of true RNGs (or at least, we haven't rigorously proven how much we can relax them). But many of those short-comings only have a very small, practically unnoticeable effect on the results of actual computations that we perform. These violations are not dispositive, all-or-nothing, go/no-go kinds of things. They have an effect size, and we can tolerate more or less of the effect.

You show, convincingly, of course, that certain-sized PRNGs cannot evenly generate all k! permutations for some k. The step that I'm missing is how that failure to be able to generate all of those permutations affects a concrete computation that I'd be interested in performing. I'm missing the test that you would recommend that we add to PractRand or TestU01 that would demonstrate the issue to people.

One of the lines of analysis that I found very informative from @imneme's PCG paper was to derive multiple smaller-state versions of each PRNG and see exactly where they started to fail TestU01. That gives us some way to ordinally compare PRNG architectures rather than just say, all-or-nothing, "X PRNG passes" or fails. It also lets us estimate how much headroom we have at the state sizes in use (which pass TestU01 at some large number of GiB of samples). Are there concrete computations that you can do that demonstrate a problem with 8-bit PRNGs? 16-bit PRNGs? Then we can see if this new test would tell us information about the PRNG earlier than TestU01/PractRand currently does. And that would be super-helpful.

On the other hand, if it takes _more_ data drawn from the PRNG to show a failure based on these permutations than the failure point of the small PRNGs for the current suite of tests in PractRand, then I would conclude that this issue is not of practical concern, and we could use "passes PractRand" as a good proxy for whether or not the permutation issue would have a practical effect on my computations.

But until I have a program in my hand that I can run and show people the problem, I'm not comfortable pushing for a CS-PRNG on this basis. I would be unable to explain the choice convincingly.

For those who demand that when shuffling 32 items, all 32! shuffles (i.e., all 263130836933693530167218012160000000 of them) should be generatable, rather than a PRNG that provides merely a random sample from those 32! shuffles, I'd actually say that if you're going to demand huge numbers, you're just not thinking big enough.

I would thus (facetiously) claim that the order in which those shuffles come out should not be predetermined either! Clearly you should demand that it outputs all 32! shuffles in all possible orders — (32!)! is what you need! Of course, this will require 3.8×10^18 exabytes for the state and a similar amount of entropy to initialize, but it's surely worth it to know that everything is there.

... We add a deprecation warning to np.random.xoshiro256_gen(seed) to say that it's no longer the preferred algorithm: we recommend that new code should use np.random.pcg64_gen(seed), but to continue to use the Xoshiro256 algorithm without warnings, they should explicitly use Generator(Xoshiro256(seed))

That's still bothering users with both a deprecation, and with weird-sounding names that they really don't want to know about.

The NEP says: Second, breaking stream-compatibility in order to introduce new features or improve performance will be allowed with caution.

If there's a good enough reason to update our default, then just doing that and breaking bit-for-bit reproducibility for users that did not explicitly specify an algorithm is the better option imho (and one you were arguing for before).

What I really meant by the "No default" option was "No anonymous default".

So is your point that you _want_ users to know the name of the PRNG they're using?

Look at this from a users point of view. It's hard enough to get them to go from np.random.rand & co to np.random.RandomState() and then use methods. Now we're going to introduce a better system, and what they get to see of is np.random.xoshiro256_gen()? That'd be a major regression usability-wise.

So is your point that you _want_ users to know the name of the PRNG they're using?

No, it's to mitigate the problems of having a "designated moving target" API like default_generator(seed) that people were working around (e.g. @shoyer's version argument).

Maintaining stream compatibility (which NEP 19 disclaims) is secondary to API breakage. Different BitGenerators have different effective APIs, depending on their feature-sets (settable-streams, jumpahead, primarily, though there can be others, depending on how parameterizable the PRNG is). So some changes in our default PRNG selection would actually break code (i.e. no longer runs or no longer runs correctly), not just change the values that come out.

For instance, let's say we first pick PCG64. It has a 128-bit state, 2^127 settable streams, and implements jumpahead; nice and full-featured. So people start writing default_generator(seed, stream=whatever). Now let's say that future work finds some major statistical flaw in it that makes us want to switch to something else. The next PRNG that we promote as default must have >=128-bit state (easy; I wouldn't recommend anything smaller as a general-purpose default), jumpahead (hard!), >=2^127 settable streams (whoo, boy!), in order to not break the uses of default_generator() that already exist in code. Now maybe we can live with that ratchet.

@shoyer suggested that maybe we could make the default BitGenerator always deliberately hobbled to just the least-common-denominator features. That would work! But it would also miss the opportunity to promote settable-streams to solve the parallel streams problem like @charris would like to do.

Now we're going to introduce a better system, and what they get to see of is np.random.xoshiro256_gen()? That'd be a major regression usability-wise.

If the weird-sounding name is the problem, then I'd be happy to use a more friendly, generic name, as long as the policy is otherwise the same (we add a new function and start warning about the old one). I'd consider that equivalent. We shouldn't be doing this too often.

I'm also fine if we decide to live with the ratchet and avoid a version mechanism.

My take on a "default" is that we could leave it as an implementation detail, so that Generator() would always work. I would parley this with a strong note of caution that the only way to get always reproducible results (up to changes in Generator) is to use the syntax Generator(BitGenerator(kwarg1=1,kwargs2=b,...))

It isn't practical to really hide implementation details since access to the state is required pickling.

The alternative is just to treat it like any other function -- and random generation in general -- and go through a standard deprecation cycle should there be a compelling need to change. This won't ever affect users who do things correctly, and with enough warning in docs it might be possible to get a decent hit rate on this, at least among large projects. What I am suggesting here is that one could forget the stream compatibility guarantee was ever made when thinking about the new API.

@bashtage in #13650 I disallowed access to Generator().bit_generator in a way that still allows pickling with no direct access to state. It passes the slightly rewritten test_pickle in a way that would allow use across python Threads

My question is - what is safe? There's just varying degrees of quasi-randomness with various properties. So far I have not seen anyone give a concrete example, neither here not in other issues, PRs or threads.

"Passes PractRand at _N_ GiB" for some _N_ (512, 1024) is a passable definition if you want a clear bright-line pass/fail criterion. If you want a concrete example, MT and its variants would be excluded based on this criterion. We also removed some of the older Xoroshiro family members from the PR because of this.

If you want a more sophisticated view of statistical quality that allows for ordinal ranking of algorithms, I do recommend to you Section 3 of @imneme's PCG paper, which uses profiles of reduced-state variants of the algorithms to get an idea of how much "headroom" each full algorithm has. This is quite similar to how cryptographers analyze different crypto algorithms. Any viable option being examined must pass the baseline criterion of "not being broken", but this doesn't help you rank the contenders. Instead, they build reduced-round versions of the contender algorithms and see how reduced you have to get before they can break it. If an N-round full algorithm is broken at N-1, then there's very little headroom, and cryptographers would probably avoid it. Similarly, if a 128-bit BitGenerator passes PractRand but its 120-bit version fails, it's probably pretty risky.

@mattip That seems reasonable. Although someone somewhere will

import gc
state = [o for o in gc.get_objects() if 'Xoshiro256' in str(o)][0].state

if they want to delve that deeply, that is fine. I just want to help the non-expert user

It passes the slightly rewritten test_pickle in a way that would allow use across python Threads

It is worth noting that this is an outstanding issue (#9650) -- ideally Generator() would reseed in child threads. IIRC this is only practical in Python >= 3.7

My take on a "default" is that we could leave it as an implementation detail, so that Generator() would always work. I would parley this with a strong note of caution that the only way to get always reproducible results (up to changes in Generator) is to use the syntax Generator(BitGenerator(kwarg1=1,kwargs2=b,...))

There are two kinds of reproducibility that we need to distinguish. One is that I run my program twice with the same seed and get the same results. That's the one we need to support. The other is reproducibility across versions of numpy, which we've disclaimed, at least in the strictest sense.

Argument-less Generator(), i.e. "give me an arbitrarily-seeded PRNG that numpy recommends" is not the primary use case. It doesn't require much support. "Give me the PRNG that numpy recommends with _this_ seed" is, and is what we're discussing options for. We need a way for numpy to express an opinion on how to get a seeded PRNG, and that way needs to be easy and convenient for users (or else they won't use it). I _like_ naming the algorithm (albeit through a more convenient function), but @rgommers thinks that's a step too far, and I'm sympathetic to that.

Argument-less Generator(), i.e. "give me an arbitrarily-seeded PRNG that numpy recommends" is not the primary use case. It doesn't require much support. "Give me the PRNG that numpy recommends with this seed" is, and is what we're discussing options for. We need a way for numpy to express an opinion on how to get a seeded PRNG, and that way needs to be easy and convenient for users (or else they won't use it). I like naming the algorithm (albeit through a more convenient function), but @rgommers thinks that's a step too far, and I'm sympathetic to that.

I would argue that users actually are ill-equipped to provide good seeds. For example, how many users know the right way to seed the Mersenne Twister? It's not as easy as you think — if you're not feeding it 624 random 32-bit integers (to provide 19937 bits of state), you're doing it wrong.

So actually I'd say that the right way for user to get reproducible results is to create the PRNG (without providing a seed, letting it be well seeded automatically) and then pickle it.

If the discussion is only about the right way then I'm for
Generator(BitGenerator(**kwargs)) since this will only be used by
semiaware users who care about reproducing.

I do think the default used for Generator() matters since this will be
interpreted as many as a considered choice and so take it as a
recommendation when using the seeded form.

Just to throw one more out, a class method Generator.seeded(seed[, bit_generator]) where bit generator is a string. This would allow the pattern of switching from one value to None to warn if the default was going to change, like lstsq. I would also only support a limited pallatte of but generators initially (i.e. 1). Doesn't make it easy to expose advanced features I suppose. In a perfect world it would use kwarg only to allow any keyword argument to be used which avoids most depreciation problems. Of course, this doesn't really need to be a class function, just seeded`.

On Tue, May 28, 2019, 16:38 Robert Kern notifications@github.com wrote:

My take on a "default" is that we could leave it as an implementation
detail, so that Generator() would always work. I would parley this with a
strong note of caution that the only way to get always reproducible results
(up to changes in Generator) is to use the syntax
Generator(BitGenerator(kwarg1=1,kwargs2=b,...))

There are two kinds of reproducibility that we need to distinguish. One is
that I run my program twice with the same seed and get the same results.
That's the one we need to support. The other is reproducibility across
versions of numpy, which we've disclaimed, at least in the strictest sense.

Argument-less Generator(), i.e. "give me an arbitrarily-seeded PRNG that
numpy recommends" is not the primary use case. It doesn't require much
support. "Give me the PRNG that numpy recommends with this seed" is,
and is what we're discussing options for. We need a way for numpy to
express an opinion on how to get a seeded PRNG, and that way needs to be
easy and convenient for users (or else they won't use it). I like
naming the algorithm (albeit through a more convenient function), but
@rgommers https://github.com/rgommers thinks that's a step too far, and
I'm sympathetic to that.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=ABKTSRKA4SSNW6XZEVFUMCDPXVGW5A5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMRDLI#issuecomment-496570797,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKTSROCMLHG6E6BLWI6TWDPXVGW5ANCNFSM4HPX3CHA
.

On Tue, May 28, 2019, 16:38 Robert Kern notifications@github.com wrote:

My take on a "default" is that we could leave it as an implementation
detail, so that Generator() would always work. I would parley this with a
strong note of caution that the only way to get always reproducible results
(up to changes in Generator) is to use the syntax
Generator(BitGenerator(kwarg1=1,kwargs2=b,...))

There are two kinds of reproducibility that we need to distinguish. One is
that I run my program twice with the same seed and get the same results.
That's the one we need to support. The other is reproducibility across
versions of numpy, which we've disclaimed, at least in the strictest sense.

Argument-less Generator(), i.e. "give me an arbitrarily-seeded PRNG that
numpy recommends" is not the primary use case. It doesn't require much
support. "Give me the PRNG that numpy recommends with this seed" is,
and is what we're discussing options for. We need a way for numpy to
express an opinion on how to get a seeded PRNG, and that way needs to be
easy and convenient for users (or else they won't use it). I like
naming the algorithm (albeit through a more convenient function), but
@rgommers https://github.com/rgommers thinks that's a step too far, and
I'm sympathetic to that.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=ABKTSRKA4SSNW6XZEVFUMCDPXVGW5A5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMRDLI#issuecomment-496570797,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKTSROCMLHG6E6BLWI6TWDPXVGW5ANCNFSM4HPX3CHA
.

From a usability perspective, I think we really do need to support Generator(seed). Otherwise, faced with making a choice they aren't prepared to make, users are just going to stick with RandomState.

For versioning the default bit generator in Generator, we could use bit_version=1 instead of version=1, though I'm also OK with dropping the version idea. I don't think users will need to set bit generators explicitly very often.

My preference for solving specific use cases that need particular generator features would be design new, generic BitGenerator APIs that hide implementation details. These could either be added to DefaultBitGenerator or put into new classes if their usage involves tradeoffs, e.g., ParallelBitGenerator.

I would definitely like to avoid warnings about future changes in the RNG stream due to changing the default bit generator. These warnings would just be noise to the vast majority of users who don't rely upon such details, but who set a seed in Generator just to stop their random numbers from changing spontaneously.

users are just going to stick with RandomState.

That is fine, they are not the early adopters. I am pushing hard (maybe too hard?) for the minimum possible viable API, since we can always widen the API but it is much more difficult to shrink it. The nuances between Generator(Philox()), Generator(seed(3)) and Generator(bit_version=1) are a bit hard to see until this gets out to end users.

Let's get a first version out with no Generator(seed) and get some feedback.

Let's get a first version out with no Generator(seed) and get some feedback.

OK, I have no serious objections here. In that case, we might as well require specifying the full BitGenerator for now, too.

So actually I'd say that the right way for user to get reproducible results is to create the PRNG (without providing a seed, letting it be well seeded automatically) and then pickle it.

I say the same thing, but I get very little traction with it. As you say, "Well, just because something is a bad idea doesn't mean people won't want to do it!"

Part of the problem is exacerbated by MT's huge state, which really does necessitate serialization out to a file. It's just hard to make that file-based dance the easiest API available such that users want to use it. Things will be better with a default PRNG with much smaller state. 128-bits is the size of a UUID, which is just about small enough to print out in hex and copy-paste. So a good pattern might be to write your program such that it defaults to a good entropy seed and then prints out its state in a way that you can copy-paste it the next time you run the program.

❯ python secret_prng.py
Seed: 0x977918d0c7da45e5168f72005586500c
...
Result = 0.7223650399276123

❯ python secret_prng.py
Seed: 0xe8962534e5fb585483b86119fcb852ce
...
Result = 0.10640984721018876

❯ python secret_prng.py --seed 0xe8962534e5fb585483b86119fcb852ce
Seed: 0xe8962534e5fb585483b86119fcb852ce
...
Result = 0.10640984721018876

Not sure if a pattern like this would provide both the simplicity and the future proofness.

NumPy 1.next

class Generator:
    def __init__(bitgen_or_seed=None, *, bit_generator='pcg64', inc=0):

NumPy 1.20.x

```python
class Generator:
def __init__(bitgen_or_seed=None, *, bit_generator=None, inc=None):
if bit_generator is not None or inc is not None:
warn('The default is changeing from PCG64 to AESCtr. The inc keyword'
'argument is deprecated and will raise in the future', FutureWarning)
````

NumPy 1.22

```python
class Generator:
def __init__(bitgen_or_seed=None, *, bit_generator='aesctr', inc=None, counter=0):
if bit_generator == 'pcg64' or inc is not None:
raise Exception('PCG is no longer supported and inc has been removed')
````

Not sure if a pattern like this would provide both the simplicity and the future proofness.

As I noted above in https://github.com/numpy/numpy/issues/13635#issuecomment-496589421, I think this would be surprising and frustrating for most users. I would rather require providing an explicit BitGenerator object than plan to start issuing warnings if users haven't set all optional arguments. That should really be a last resort, for cases where we discover that APIs are broken in ways we did not anticipate.

The problem is the transition period. Everyone using the defaults will suddenly get warnings with no great way to switch over during the transition period. Or at least, you move them from a "low-energy state" of Generator(seed) to a less convenient "high-energy state" of Generator(seed, bit_generator='aesctr'). Since the goal of this API was to provide a convenient "low-energy state", then we've failed our purpose during this transition. We did this once with one of our histogramming functions, IIRC, and it was a nightmare.

This is common to all deprecations that try to change the meanings of arguments in-place. Deprecations that _move_ you from one function to another are much easier to manage, and that's what I was advocating.

Let's get a first version out with no Generator(seed) and get some feedback.

By "first version", do you mean a full numpy release? Or just getting the PR merged (which has since happened)?

If a full numpy release, then we still have some things to determine, like how many BitGenerators we include. If we include the full current complement, then we have disclosed some of the options.

Deprecations that _move_ you from one function to another are much easier to manage, and that's what I was advocating.

+1 agreed

No, it's to mitigate the problems of having a "designated moving target" API like default_generator(seed) that people were working around (e.g. @shoyer's version argument).

Maintaining stream compatibility (which NEP 19 disclaims) is secondary to API breakage. Different BitGenerators have different effective APIs

Ah okay, now it makes more sense to me.

If the weird-sounding name is the problem, then I'd be happy to use a more friendly, generic name, as long as the policy is otherwise the same (we add a new function and start warning about the old one). I'd consider that equivalent. We shouldn't be doing this too often.

This sounds like the best solution so far. It should be possible to choose a sane name here. And there's a large space of other sane names - that we likely will never need.

Something like np.random.generator or np.random.default_generator.

how many BitGenerators we include

Could you open a separate issue with a proposal to drop ones you think we should remove out of the currently included (MT19937, DSFMT, PCG32, PCG64, Philox, ThreeFry, Xoshiro256, Xoshiro512)?

We still haven't resolved the issue at hand here: which BitGenerator should be default (currently Xoshiro256)

Well, this issue is more about "which one should numpy promote as the distinguished BitGenerator," which feeds into choice of default but also which ones should be included or dropped. The mechanics by which we provide defaults (if we provide defaults) adds some constraints, so these are all things that more or less need to be decided together. It's a big hairy mess, and after all the work that you've done shepherding this PR through, I'm sure you are exhausted to see another mega-thread with no one contributing code, so you have my sympathies. :-)

As far as the algorithms per se go, I've already given my recommendations: we must keep MT19937 for RandomState and comparison purposes, and I like PCG64 for recommendation purposes.

I messed around a bit on Compiler Explorer, and I think I've implemented PCG64 for 64-bit MSVC using intrinsics in a way that forces the compiler to generate assembly close to clang's uint128_t math: https://godbolt.org/z/ZnPd7Z

I don't have a Windows dev environment set up at the moment, so I don't know if it's actually _correct_... @bashtage would you mind giving this a shot?

Without patch:

Uniforms per second
************************************************************
PCG64            62.77 million

With patch:

Uniforms per second
************************************************************
PCG64           154.50 million

The patch passes the tests, including generating the same set of 1000 uint64 values for 2 different seeds.

For comparrison against GCC native and in comparrison mode:

Time to produce 1,000,000 Uniforms
************************************************************
Linux-64, GCC 7.4                      PCG64            4.18 ms
Linux-64, GCC 7.4, Forced Emulation    PCG64            5.19 ms
Win64                                  PCG64            6.63 ms
Win32                                  PCG64           45.55 ms

wow, that seems really bad. Maybe we should extend the information on the comparisions page to demonstrate performance msvc2017-on-win{64, 32} vs. gcc7.4-on-linux{64, 32} on the same machine (I just assume you are using msvc2017, you should probably include that info somewhere).

Win32 is hopeless here. I suspect that Linux 32 bit will also be quite terrible, but don't have a 32 bit Linux system to test on easily.

I can definitely see the case for making a recommendation for people who are stick on a 32 bit machine (most likely Windows due to corporate IT policies). This recco is clear: DSFMT for 32 bit (or MT19937 is also good). Benchmarks would be good though.

For what it's worth, I'm rather sceptical of the oft-repeated PCG claim of multiple independent random streams. Has anyone done any serious statistical analysis to back up the claim of independence? (Actually, I think O'Neill's paper only refers to "distinct" streams, without any claim of independence.)

I think there's good reason to be sceptical: for a given LCG multiplier, all these distinct streams are simply related via scaling[*]. So given any two LCG streams with the same multiplier, one of those will simply be a constant multiple (modulo 2**64 or 2**32 as appropriate) of the other, though with different starting points. The permutation part of the PCG will help hide this a bit, but it really wouldn't be surprising if there were statistically detectable correlations.

So distinct streams, sure, but I wouldn't take the claim of independent streams at face value without some serious testing.

[*] Example: suppose x[0], x[1], x[2], ... is a standard 64-bit LCG stream, with x[i+1] := (m*x[i] + a) % 2**64. Set y[i] := 3*x[i] % 2**64 for all i. Then y[i] is an LCG stream with y[i+1] := (m*y[i] + 3*a) % 2**64, so by simply scaling the original stream you've produced one of these distinct LCG streams with the same multiplier but different additive constant. By using other odd multipliers in place of 3, and assuming that we're only interested in full-period LCGs (and so a is odd), you'll get all the possible full-period LCGs with that multiplier.


EDIT: Fixed wrong statement about number of conjugacy classes.

I think the most thorough public analysis of PCG streams is here: http://www.pcg-random.org/posts/critiquing-pcg-streams.html

@imneme Can you expand on your final advice? "correspondence with David Blackman shows that it may be easier than I had thought to make “nearby” streams with correlated initializations like a constant seed and streams of 1,2,3,4. If you're going to be using multiple streams at the same time, for now I'll recommend that the stream id and the seed should be distinct and not have obvious correlations with each other, which means don't make them both 1,2,3,4."

Does this mean that you think it's okay to have a single good seed (e.g. derived from an entropy source) and then stream IDs 1,2,3,4? Or should both seed and stream ID be chosen randomly from a good entropy source?

Some Linux-32 (Ubuntu 18.04/GCC 7.4) numbers

Time to produce 1,000,000 Uniforms
***************************************************************
Linux-64, GCC 7.4                      PCG64            4.18 ms
Linux-64, GCC 7.4, Forced Emulation    PCG64            5.19 ms
Win64                                  PCG64            6.63 ms
Win32                                  PCG64           45.55 ms
Linux-32, GCC 7.4                      PCG64           25.45 ms

So it is twice as fast as Win-32 but slow. All 4 timings were made on the same machine

Other Linux-32/GCC 7.4 Timing Results
-----------------------------------------------------------------
DSFMT            6.99 ms
MT19937         13.09 ms
Xoshiro256      17.28 ms
numpy           15.89 ms

NumPy is the NumPy 1.16.4. DSFMT is the only geneator with good performance on 32-bit (x86). This should be documented clearly for any 32 bit users. MT19937 is also a relatively good choice for a 32 bit user.

So we need to have MT19937 for legacy purposes. If we do want to be minimal about which PRNGs we include (i.e. MT19937 plus our single general-purpose recommendation), then I would not feel compelled to use 32-bit performance to restrict our single general-purpose recommendation nor feel compelled to add a third "recommended-for-32-bit" PRNG. MT19937 will always be available, and it's no worse than what they currently have. And third-party packages will be available for the more niche uses.

Of course, if we do want to include a more complete set of PRNGs for other reasons, then we can make all kinds of specific recommendations in the documentation.

I was curious how much the "P" part of PCG mitigated potential issues from correlated streams.

So here's (perhaps) the worst possible case for the LCGs: where the additive constant of one LCG stream is the exact negation of the additive constant for the other. Then with appropriately terrible choices of seed, we end up with one of the LCG streams being the exact negation of the other.

But now if we're using both streams to generate a series of floats, both the permutation part of the PCG and the conversion to float64 should help us a bit.

Here's a plot that shows how much the permutation helps:

streams

That's a scatter plot of 10000 floats from one such stream, against 10000 from its negated twin. Not terrible, but not great either: there are clear artefacts.

Not sure what to conclude from this: it's absolutely a contrived example, that you're unlikely (I hope) to run into by accident. On the other hand, it does demonstrate that some thought and care is required if you really need multiple uncorrelated streams.

For the record, here's the source:

import matplotlib.pyplot as plt
import numpy as np

from pcgrandom import PCG64

gen1, gen2 = PCG64(), PCG64()
multiplier, increment, state = gen1._get_core_state()
new_increment, new_state = -increment % 2**128, -state % 2**128
gen2._set_core_state((multiplier, new_increment, new_state))

xs = np.array([gen1.random() for _ in range(10**4)])
ys = np.array([gen2.random() for _ in range(10**4)])
plt.scatter(xs, ys, s=0.1)
plt.show()

PCG64 is the generator that O'Neill calls PCG-XSL-RR (section 6.3.3 of the PCG paper). The pcgrandom package is from here

I thought the standard way to get independent streams is using jumpahead().
Re-seeding to get “independent” streams is in general dangerous.

Counter/hash generators have a trivial jumpahead(). Does PCG?

Also a plea from a user: please provide at least one bitstream that is
cryptographic quality, with an unbounded state space.

Cheers,
Philip

(EDIT by seberg: Removed email quote)

@pbstark: This isn't just reseeding: the two underlying LCG generators are actually distinct: x ↦ mx + a (mod 2^128) and x ↦ mx + b (mod 2^128) for different increments a and b. O'Neill's PCG paper sells the idea of being able to create different streams by changing that LCG increment (see section 4.3.2 of the paper).

But the simplicity of LCG means that changing that additive constant _does_ just amount to a jumpahead by some unknown amount in the original generator, combined with a simple linear transformation (multiplication by a constant, or just addition of a constant in some cases).

Not a reason not to use PCG, and I'm not for a moment arguing that it's not suitable for NumPy's new main PRNG; I just don't want people to be taken in by the promise of "independent" random streams. At best, the settable streams idea for PCG offers a convenient way to do something equivalent to a quick jumpahead plus a bonus extra multiplicative or additive transformation.

We discussed the cryptographic one a bit in the community call. I think we were a bit cautious about it. It seems like a good idea, but if we include a cryptographically sound RNG, we also have to keep up with any security issues that come up, since we do not know whether users use them for actual cryptography purposes.

On the note of how many to include: The consensus tended towards it being fine to having a few more bit generators around (some small documentation would be good of course). The maintenance burden does not seem too large. In the end, my guess is that we would go with whatever Kevin and Robert suggest.

On the note of names: I personally do not mind using the RNG names and forcing users to use them, the only downside is that we might have to look up the name while coding. We should just try to have as little deprecation warnings as possible. I like the minimal exposed API for the default RNG without seeding.

@mdickinson, I tried to reproduce your graph myself and failed. I used the canonical C++ version of the code with this program which should be the moral equivalent of yours.

#include "pcg_random.hpp"
#include <iostream>
#include <random>

int main() {
    std::random_device rdev;
    pcg_detail::pcg128_t seed = 0;
    pcg_detail::pcg128_t stream = 0;
    for (int i = 0; i < 4; ++i) {
        seed   <<= 32;           
        seed   |= rdev();
        stream <<= 32;           
        stream |= rdev();
    }
    pcg64 rng1(seed,stream);
    pcg64 rng2(-seed,-stream);
    std::cerr << "RNG1: " << rng1 << "\n";
    std::cerr << "RNG2: " << rng2 << "\n";
    std::cout.precision(17);
    for (int i = 0; i < 10000; ++i) {
        std::cout << rng1()/18446744073709551616.0 << "\t";
        std::cout << rng2()/18446744073709551616.0 << "\n";
    }
}

When I run this, it outputs (to allow reproducibility):

RNG1: 47026247687942121848144207491837523525 203756742601991611962280963671468648533 41579532896305845786243518008404876432
RNG2: 47026247687942121848144207491837523525 136525624318946851501093643760299562925 52472962479578397910044896975270170620

and data points that can plot the following graph:

corr1

If you can figure out what I'm doing differently, that'd be helpful.

(I'm not saying this to refute the idea that correlations are possible, because they are, and I'll write a separate comment on the topic, but it was in writing that comment that I realized I couldn't reproduce your result using my usual tools.)

@mdickinson punched in the computed state and increment directly into the internals, bypassing the usual initialization routine, which has two step-advancements.

I wrote a quick little driver script that interleaves multiple PCG32 streams, constructed in various ways, to feed into PractRand. It uses master of numpy on Python 3. When I punch in the adversarial internal states/increments directly, PractRand fails quickly. It's unclear to me if we can find reasonable ways to find adversarial _seeds_ (that actually pass through the initializer routine) to achieve the adversarial states.

As noted in my blog post that was mentioned earlier, PCG's streams have a lot in common with SplitMix's.

Regarding @mdickinson's graph, for _every_ PRNG that allows you to seed its entire state, including counter-based cryptographic ones, we can contrive seedings where we'd have PRNGs whose outputs were correlated in some way (the easiest way to do so is to make PRNG states that are a short distance apart, but often we can do other things based on an understanding of how they work). And although PRNGs that don't allow full-state seeding can avoid this issue, doing so just introduces a new one, only providing practical access to a tiny fraction of their possible states.

The right way to think of streams is just more random state that needs to be seeded. Using small values like 1,2,3 is generally bad idea for any seeding purposes for _any_ PRNG (because if everyone does favors these seeds, their corresponding initial sequences will be overrepresented).

We can choose not to call it a stream at all and just call it state. That's what Marsaglia did in XorWow. If you look at the code, the Weyl sequence counter doesn't interact with the rest of the state at all, and, like LCGs, and variations in the initial value really just amount to an added constant.

SplitMix's, PCG's and XorWow's streams are what we might call “stupid” streams. They constitute a trivial reparameterization of the generator. There is value in this, however. Suppose that without streams, our PRNG would have an interesting close repeat of 42, where 42 crops up several times in quick succession and only does this for 42 and no other number. With stupid “just an increment” or “just an xor” streams, we'll actually avoid hardwiring the weird repeat to 42; all numbers have a stream in which they are they weird repeat. (For this reason, the fix I'd apply to repair the close-repeat problems in Xoshiro 256 is to mix in a Weyl sequence.)

I'm not an expert, but on the cryptography side, what is proposed that is not available in:
https://cryptography.io/en/latest/ from the Python Cryptographic Authority?

Their page on random number generation also mentions:

Starting with Python 3.6 the standard library includes the secrets module, which can be used for generating cryptographically secure random numbers, with specific helpers for text-based formats.

I guess maybe adding arrays to the generation. I have to wonder if the potential maintenance burden of being associated with crypotgraphic robustness is really worth it and appropriate in NumPy vs. say communicating with the pyca and maybe thinking about a third-party generator / plugin for that. I think Nathaniel mentioned a similar concern previously.

Indeed, it seems to me that things like the potential dtype refactor / enhancement are also designed to provide API infrastructure without necessarily taking on the burdens of maintaining a large variety of specialized new applications.

BTW, there is also more about contriving correlated PRNG states in my response to Vigna's PCG critique [specifically, this section]. Something I observed there is that with PCG, because it has a distance function, you can actually check with the distance function to detect contrived seeds. In PRNGs without a distance function, people can still contrive seeding pairs that are poor choices (especially if they're bypass the public API for seeding) but there is no provided mechanism that can detect even the most blatant of contrivances.

On the one hand, it's fun to think about whether it's possible to take your favorite (or least favorite) PRNG and contrive seeds for it that make it to do amusing or terrible things (e.g., this pathology)…

But looking at the bigger picture, I think it makes sense to look at issues facing users in practice. What advice we give them, etc. Most users don't realize that _for all PRNGs (past and future)_, a 32-bit seed is an absolutely terrible idea and results in trivially detectible bias no matter which PRNG is in play. Sure, we can brush that off and instead spend our time worrying about whether someone might managed to initialize the Mersenne Twister to a mostly-zeros state (or the all-zeros state where LFSRs fail to work at all!), or whether someone might initialize Xoshiro to near the point where it repeats a the same output seven times in the space of eleven outputs, or contrive two similar PCG streams, or whatever, but all these contrivances have basically infinitesimal (in practice zero) chance of happening if the generator is seeded with random data. As intellectually engaging and academically interesting as these diversions are, thinking about them while mostly ignoring that fact that users typically have little idea of what they are doing when it comes to seeding is fiddling while Rome burns.

If inc=1,2,3,4 is a bad idea, wouldn't that suggest that it should be either documented very clearly, or maybe we should have a slightly different API? Maybe even new_generator = (Bit)Generator().independent(), we can put a warning on it if the (underlying) bit generator does not provide a great way to achieve that.

Also, depending on how bad 32bit seeding is. Can we think of a nice API for creating and storing a seed to freeze it? I do not know. Maybe even a "create a frozen seed cache file if it does not exist".

For PCG could just seed -> uint64_t[2] -> splitmix64 (seed_by_array) -> uint128 which would make sure that low, consecutive seeds are spread out.

For PCG could just seed -> uint64_t[2] -> splitmix64 (seed_by_array) -> uint128 which would make sure that low, consecutive seeds are spread out.

Or just use any good integer hash. (It should be a bijection.). There are plenty of cheap and short ones. A few rounds of Multiply–XorShift is fine.

To @mdickinson's point, I think that he still wants a bit of convincing that the stream dependence is limited to a small set of contrived/adversarial settings. And if that's the case, then we can solve it with proper practices to prevent such instances. With the current code that we have, there are some bad states that users might easily lapse into with the current APIs. I can confirm David Blackman's finding that setting both seed=1 and inc=0,1,2,... creates correlations. My latest PractRand driver for interleaved PCG32 streams can be used to demonstrate this.

❯ ./pcg_streams.py --seed 1 --inc 0 |time ./RNG_test stdin32
[
    {
        "bit_generator": "PCG32",
        "state": {
            "state": 12728272447693586011,
            "inc": 1
        }
    },
    {
        "bit_generator": "PCG32",
        "state": {
            "state": 7009800821677620407,
            "inc": 3
        }
    }
]
RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x470537d5
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x470537d5
length= 128 megabytes (2^27 bytes), time= 4.0 seconds
  Test Name                         Raw       Processed     Evaluation
  BCFN(2+0,13-3,T)                  R=  +9.6  p =  2.3e-4   mildly suspicious
  ...and 116 test result(s) without anomalies

rng=RNG_stdin32, seed=0x470537d5
length= 256 megabytes (2^28 bytes), time= 8.7 seconds
  Test Name                         Raw       Processed     Evaluation
  BCFN(2+0,13-2,T)                  R= +26.1  p =  6.3e-13    FAIL           
  ...and 123 test result(s) without anomalies

./RNG_test stdin32  8.86s user 0.11s system 93% cpu 9.621 total

I haven't yet run into a failure with a random seed but the same close increments. I'll check in with you tomorrow.

I do notice that we are not using the recommended default increment when none is specified in the constructor. We should probably fix that. Maybe that would be a good base number from which we derive the actual increment from the given stream ID instead of 2*inc + 1.

We can try to build some tooling to assist people in using the default entropy-seeds and saving them out. One question that I have is if the increments for multiple streams can be generated simply or if we also need to entropy-sample them and save them as well. It is really convenient to be able to encode the "initial state" of a simulation as single number that can be copy-pasted from a colleague's email rather than an opaque file. With these smaller PRNGs with only 128 or 256 bits of state, I can easily print that out in hex into my log file and then just copy-paste it into my command line when I want to reproduce it. It's bigger than a 32-bit integer, but it's manageable. If I have to entropy-sample all of my stream IDs as well, then I have to forgo that and make sure that I record everything in a state file somewhere. That might foreclose some use cases that we've discussed where we want to dynamically spawn new streams. If I can just increment a counter to get a good stream ID (maybe deriving it through a hash of the counter, or whatever), then I only need to record the initial seed and not the stream IDs.

IIRC, the secrets module calls the OS's entropy source, which can be quite
bad in some systems, and is not replicable/reproducible regardless.

On Wed, May 29, 2019 at 3:19 PM Tyler Reddy notifications@github.com
wrote:

I'm not an expert, but on the cryptography side, what is proposed that is
not available in:
https://cryptography.io/en/latest/ from the Python Cryptographic Authority
https://github.com/pyca?

Their page on random number generation
https://cryptography.io/en/latest/random-numbers/ also mentions:

Starting with Python 3.6 the standard library includes the secrets
https://docs.python.org/3/library/secrets.html module, which can be
used for generating cryptographically secure random numbers, with specific
helpers for text-based formats.

I guess maybe adding arrays to the generation. I have to wonder if the
potential maintenance burden of being associated with crypotgraphic
robustness is really worth it and appropriate in NumPy vs. say
communicating with the pyca and maybe thinking about a third-party
generator / plugin for that. I think Nathaniel mentioned a similar concern
previously.

Indeed, it seems to me that things like the potential dtype refactor /
enhancement are also designed to provide API infrastructure without
necessarily taking on the burdens of maintaining a large variety of
specialized new applications.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWKXSSTX6QI7HJ65GYTPX36O3A5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWQZP6A#issuecomment-497129464,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANFDWKGZMUB67VPCMFZYGTPX36O3ANCNFSM4HPX3CHA
.

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor, Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

@tylerjereddy Those are for getting a small amount of random bits from a physical entropy source that are unpredictable to an attacker (and you!). They are used in cryptography for things like initialization vectors, nonces, keys, which are all short. The whole point of these is that there is no way to reproduce them, which is at odds with the numerical simulation purposes of np.random. That page is not talking about _reproducible_ cryptographically secure PRNGs, which are also things that exist and _could_ be built from the primitives available in the cryptography package. In _practice_, however, we have better implementations of those algorithms already available to us in efficient C code, at least the ones that have been formulated and tested for simulation purposes. @bashtage implemented a few for this framework.

I also want to be clear to the numpy team that what @pbstark is proposing is not just any crypto-based PRNG. Rather, he wants one with _unbounded state_, which would provide the mathematical property that he's looking for.

Most of the crypto-based PRNGs that are commonly considered for simulation do _not_ have the unbounded state that @pbstark wants. They are typically based around encrypting a finite counter. Once that counter rolls around, you've hit the finite period. Technically, his cryptorandom is also bounded to 2**(256+64) unique initial conditions due to the fixed-size 256-bit digest state and fixed-size 64-bit length counter. That probably points the way to implementing a truly-unbounded PRNG by making the length counter arbitrarily-sized, but I've never seen such an algorithm published or tested.

On the other hand, if you just want a PRNG algorithm that has an arbitrarily-sized state, just one that's fixed at the beginning to something above whatever you need, then PCG's extended generators would work well for that task. These are distinctly not CS-PRNGs, but they would actually satisfy @pbstark's desire to have huge state spaces on demand. Nonetheless, I don't recommend that we include them in numpy.

There are other properties that the standard, bounded, CS-PRNGs have that we may want, but they aren't a no-brainer default, IMO.

Cryptorandom's state space is not the 256 bit hash: it's unbounded. The
seed state is a string of arbitrary length, and each update appends a zero
to the current state. Incrementing an unbounded integer counter would
accomplish the same thing. We initially implemented that, but changed to
append rather than increment, because it allows a more efficient update to
the digest than hashing each state from scratch (noticeable speedup).

On Wed, May 29, 2019 at 7:26 PM Robert Kern notifications@github.com
wrote:

@tylerjereddy https://github.com/tylerjereddy Those are for getting a
small amount of random bits from a physical entropy source that are
unpredictable to an attacker (and you!). They are used in cryptography for
things like initialization vectors, nonces, keys, which are all short. The
whole point of these is that there is no way to reproduce them, which is at
odds with the numerical simulation purposes of np.random. That page is
not talking about reproducible cryptographically secure PRNGs, which
are also things that exist and could be built from the primitives
available in the cryptography package. In practice, however, we have
better implementations of those algorithms already available to us in
efficient C code, at least the ones that have been formulated and tested
for simulation purposes. @bashtage https://github.com/bashtage implemented
a few https://github.com/numpy/numpy/issues/13635#issuecomment-496287650
for this framework.

I also want to be clear to the numpy team that what @pbstark
https://github.com/pbstark is proposing is not just any crypto-based
PRNG. Rather, he wants one with unbounded state, which would provide
the mathematical property that he's looking for.

Most of the crypto-based PRNGs that are commonly considered for simulation
do not have the unbounded state that @pbstark
https://github.com/pbstark wants. They are typically based around
encrypting a finite counter. Once that counter rolls around, you've hit the
finite period. Technically, his cryptorandom
https://statlab.github.io/cryptorandom/ is also bounded to 2**(256+64)
unique initial conditions due to the fixed-size 256-bit digest state and
fixed-size 64-bit length counter. That probably points the way to
implementing a truly-unbounded PRNG by making the length counter
arbitrarily-sized, but I've never seen such an algorithm published or
tested.

On the other hand, if you just want a PRNG algorithm that has an
arbitrarily-sized state, just one that's fixed at the beginning to
something above whatever you need, then PCG's extended generators
http://www.pcg-random.org/party-tricks.html would work well for that
task. These are distinctly not CS-PRNGs, but they would actually satisfy
@pbstark https://github.com/pbstark's desire to have huge state spaces
on demand. Nonetheless, I don't recommend that we include them in numpy.

There are other properties that the standard, bounded, CS-PRNGs have that
we may want, but they aren't a no-brainer default, IMO.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWLGAF6YVIWXYZ2LTT3PX43ONA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWRFG4Q#issuecomment-497177458,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANFDWLIJD3UCVY3NXCLPKDPX43ONANCNFSM4HPX3CHA
.

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor, Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

I'm afraid that's not how the online update of SHA-256 works. The state that it maintains is only the 256-bit digest and the fixed-size 64-bit length counter that it appends when it computes the update. It doesn't hold the whole text. Hashes compress. That's how it's able to do the efficient update on each byte. Fundamentally, there are many initializations/past histories that map to the same internal SHA-256 state, which is finite. While the cycles are certainly long, maybe longer than 2**(256+64), they certainly exist. And in any case, you only have less than 2**(256+64) possible initial conditions (for each text length 0 to 2**64-1, you can have at most 2**256 internal hash states; once the text length is over 32 bytes, there must be collisions a la pigeonhole). There just aren't any more bits in the data structure.

Thanks very much; understood. I would express it differently: the state
space is unbounded, but (by pigeonhole) many distinct initial states must
produce indistinguishable output sequences.

On Wed, May 29, 2019 at 8:21 PM Robert Kern notifications@github.com
wrote:

I'm afraid that's not how the online update of SHA-256 works. The state
that it maintains is only the 256-bit digest and the fixed-size 64-bit
length counter that it appends when it computes the update. It doesn't hold
the whole text. Hashes compress. That's how it's able to do the efficient
update on each byte. Fundamentally, there are many initializations/past
histories that map to the same internal SHA-256 state, which is finite.
While the cycles are certainly long, maybe longer than 2(256+64), they
certainly exist. And in any case, you only have less than 2
(256+64)
possible initial conditions (for each text length 0 to 264-1, you can
have at most 2
256 internal hash states; once the text length is over 32
bytes, there must be collisions a la pigeonhole). There just aren't any
more bits in the data structure.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13635?email_source=notifications&email_token=AANFDWO6HRJZHTVBF2TLK3LPX5B3TA5CNFSM4HPX3CHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWRHTSY#issuecomment-497187275,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANFDWM56HCQRZXDO3BAQHDPX5B3TANCNFSM4HPX3CHA
.

--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor, Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

It's also the case that there are only 2**(256+64) states that it can possibly go through. Since the update takes the same form every time, you eventually hit a state that you've seen before and enter a loop of unknown (to me) but finite period. Whether it's the finite number of initial states or a finite period, cryptorandom has both, and I think they are smaller even than MT19937.

Not that I think that's a problem with cryptorandom, per se. I have not been convinced that having an unbounded set of initial states or unbounded period is something that is actually needed in a practical sense.

I haven't yet run into a failure with a random seed but the same close increments. I'll check in with you tomorrow.

Still going strong at 512GiB:

❯ ./pcg_streams.py -i 0 |time ./RNG_test stdin32
[
    {
        "bit_generator": "PCG32",
        "state": {
            "state": 10843219355420032665,
            "inc": 1
        }
    },
    {
        "bit_generator": "PCG32",
        "state": {
            "state": 5124747729404067061,
            "inc": 3
        }
    }
]
RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0xb83f7253
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0xb83f7253
length= 128 megabytes (2^27 bytes), time= 4.0 seconds
  no anomalies in 117 test result(s)

rng=RNG_stdin32, seed=0xb83f7253
length= 256 megabytes (2^28 bytes), time= 8.6 seconds
  no anomalies in 124 test result(s)

rng=RNG_stdin32, seed=0xb83f7253
length= 512 megabytes (2^29 bytes), time= 16.9 seconds
  Test Name                         Raw       Processed     Evaluation
  BCFN(2+2,13-2,T)                  R=  -8.0  p =1-2.1e-4   mildly suspicious
  ...and 131 test result(s) without anomalies

rng=RNG_stdin32, seed=0xb83f7253
length= 1 gigabyte (2^30 bytes), time= 33.8 seconds
  no anomalies in 141 test result(s)

rng=RNG_stdin32, seed=0xb83f7253
length= 2 gigabytes (2^31 bytes), time= 65.7 seconds
  Test Name                         Raw       Processed     Evaluation
  BCFN(2+2,13-1,T)                  R=  -7.8  p =1-3.8e-4   unusual          
  ...and 147 test result(s) without anomalies

rng=RNG_stdin32, seed=0xb83f7253
length= 4 gigabytes (2^32 bytes), time= 136 seconds
  no anomalies in 156 test result(s)

rng=RNG_stdin32, seed=0xb83f7253
length= 8 gigabytes (2^33 bytes), time= 270 seconds
  no anomalies in 165 test result(s)

rng=RNG_stdin32, seed=0xb83f7253
length= 16 gigabytes (2^34 bytes), time= 516 seconds
  no anomalies in 172 test result(s)

rng=RNG_stdin32, seed=0xb83f7253
length= 32 gigabytes (2^35 bytes), time= 1000 seconds
  no anomalies in 180 test result(s)

rng=RNG_stdin32, seed=0xb83f7253
length= 64 gigabytes (2^36 bytes), time= 2036 seconds
  no anomalies in 189 test result(s)

rng=RNG_stdin32, seed=0xb83f7253
length= 128 gigabytes (2^37 bytes), time= 4064 seconds
  no anomalies in 196 test result(s)

rng=RNG_stdin32, seed=0xb83f7253
length= 256 gigabytes (2^38 bytes), time= 8561 seconds
  no anomalies in 204 test result(s)

rng=RNG_stdin32, seed=0xb83f7253
length= 512 gigabytes (2^39 bytes), time= 19249 seconds
  no anomalies in 213 test result(s)

Ah, if we do more than 2 streams with sequential increments, we see failures quickly. We will need to look at deriving the actual increment from the user input with some kind of bijection to spread it around the space.

❯ ./pcg_streams.py -n 3  | time ./build/RNG_test stdin32
[
    {
        "bit_generator": "PCG32",
        "state": {
            "state": 18394490676042343370,
            "inc": 2891336453
        }
    },
    {
        "bit_generator": "PCG32",
        "state": {
            "state": 12676019050026377766,
            "inc": 2891336455
        }
    },
    {
        "bit_generator": "PCG32",
        "state": {
            "state": 6957547424010412162,
            "inc": 2891336457
        }
    }
]
RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0x4a9d21d1
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0x4a9d21d1
length= 128 megabytes (2^27 bytes), time= 3.2 seconds
  Test Name                         Raw       Processed     Evaluation
  DC6-9x1Bytes-1                    R= +19.4  p =  1.6e-11    FAIL           
  [Low8/32]DC6-9x1Bytes-1           R= +13.2  p =  4.6e-8    VERY SUSPICIOUS 
  ...and 115 test result(s) without anomalies

@imneme

If you can figure out what I'm doing differently, that'd be helpful.

I think you need to replace the line pcg64 rng2(-seed,-stream); in your code with pcg64 rng2(-seed,-1-stream);, to allow for the increment = 2 * stream + 1 transformation. Negation of the increment corresponds to bitwise negation of the stream index. If I make that change and run your code, I see something very much like my earlier plot. (And I confirm that if I don't make that change, then everything looks good visually.)

@imneme

The right way to think of streams is just more random state that needs to be seeded.

Agreed. I think that gives a very clean picture for LCGs: for a 64-bit LCG with fixed well-chosen multiplier a, we then have a state space of size 2^127, consisting of all pairs (x, c) of integers mod 2^64, where c is the odd increment. The state update function is next : (x, c) ↦ (ax+c, c), dividing the state space into 2^63 disjoint cycles of length 2^64 each. Seeding just involves picking a starting point in this state space.

There's then an obvious group action that makes analysis easy, and makes the relationships between the different streams clear: the group of invertible 1-d affine transformations on Z / 2^64Z has order exactly 2^127, and acts transitively (and so also faithfully) on the state space: the affine transformation y ↦ ey + f maps the pair (x, c) to (ex + f, ec + (1-a)f). That group action commutes with the next function, so the unique group element that transforms one point (x, c) in the state space into another, (x2, c2), also maps the sequence generated by (x, c) to the sequence generated by (x2, c2).

tl;dr: for a fixed multiplier, any two LCG sequences with the same multiplier (whether using the same increment, as in the lookahead case, or different increments) are related by an affine transformation. In the unfortunate cases that we want to avoid, that affine transformation is something horribly simple, like adding 2 or multiplying by -1. In the general case, we hope that the affine transformation is complicated enough that standard statistical tests can't detect the relationship between the two streams.

@mdickinson covers the situation nicely. PCG's permutations will change things a bit from the LCG case, but not a lot. The whole point of the PCG permutations is that we can choose how much scrambling to do. Because truncated 128-bit LCGs already pass BigCrush, when I picked a permutation for pcg64 I chose a modest amount of scrambling for that size LCG (XSL RR). In contrast, 64-bit LCGs quickly fail several statistical tests so pcg32 uses a bit more scrambling, but it's still not the strongest permutation from the PCG paper. As I mentioned in the thread pull-request thread, I've begun to lean towards a stronger PCG permutation (RXS M) for the pcg32 use case. It's not yet the default, you have to ask for that version explicitly, but there's a good chance I'll switch the default over when I do a major version bump for PCG. (RXS M is the high half of RXS M XS, which Vigna has tested extensively at this size and also permutation that David Blackman likes).

We can visualize the difference an updated version of the near-streams test program at uses both schemes for pcg32 (XSH RR and RCS M [and the raw underlying LCG, too]):

#include "pcg_random.hpp"
#include <iostream>
#include <random>

// Create a "PCG" variant with a trivial output function, just truncation

template <typename xtype, typename itype>
struct truncate_only_mixin {
    static xtype output(itype internal)
    {
        constexpr size_t bits = sizeof(itype) * 8;
        return internal >> (bits/32);
    }
};

using lcg32 = pcg_detail::setseq_base<uint32_t, uint64_t, truncate_only_mixin>;

int main() {
    std::random_device rdev;
    uint64_t seed = 0;
    uint64_t stream = 0;
    for (int i = 0; i < 2; ++i) {
        seed   <<= 32;           
        seed   |= rdev();
        stream <<= 32;           
        stream |= rdev();
    }
    lcg32 rng1(seed,stream);
    lcg32 rng2(-seed,-1-stream);
    // pcg32 rng1(seed,stream);
    // pcg32 rng2(-seed,-1-stream);
    // pcg_engines::setseq_rxs_m_64_32 rng1(seed,stream);
    // pcg_engines::setseq_rxs_m_64_32 rng2(-seed,-1-stream);
    std::cerr << "RNG1: " << rng1 << "\n";
    std::cerr << "RNG2: " << rng2 << "\n";
    std::cout.precision(17);
    for (int i = 0; i < 10000; ++i) {
        std::cout << rng1()/4294967296.0 << "\t";
        std::cout << rng2()/4294967296.0 << "\n";
    }
}

Before we begin, let's look at the graph @mdickinson drew but for just a LCG with no permutation, just truncation:

corr-truncated-lcg

Note that this is for pathological LCG with states that correlate. If instead we'd just picked two LCGs with randomly chosen chosen additive constants (but the same start value), it'd look like this:

corr-truncated-lcg-good

Moving on to PCG's output functions, if we use is XSH RR on the pathological case it looks like this — it's a big improvement on the graph above but clearly it's not fully obscuring the horribleness:

corr-pcg32-current

and this is RXS M with the same underlying (badly correlated) LCG pair:

corr-pcg32-future

But this is only something I'm mulling for pcg32. The performance penalty is tiny, and pcg32 is small enough that I can imagine some heavy user being worried about creating a ton of random-seeded pcg32 generators and asking for a ton of numbers from them and having a not-infinitesimal-enough chance of correlations. I am, frankly, in two minds about it though, because would this mythical power user be using pcg32 in the first place.

One reason I'm not too bothered about making pcg64's streams more independent is that I'm not sure I see a use case where it would be sensible to keep all other state the same and switch the stream to a different one (e.g., to a random value, let alone to one nearby). For pretty much all PRNGs, the right way to make a second one is to initialize it with fresh entropy.

In conclusion, for NumPy, I think it may make the most sense to just consider that PCG64 wants two 256 bits of state (technically it's 255 since the high bit of the stream is ignored) and call it done. That will avoid issues related with the API as well because it'll be one less feature people will have in one BitGenerator and not in another.

(But you might want to switch the 32-bit PCG variant to the RXS M one. For the C source, you need a recent version as I used not bother providing RXS M explicitly in the C code, only making it available in the C++ incarnation.)

[Sorry if this is more than you ever wanted to know! Well, not _that_ sorry. ;-)]

One reason I'm not too bothered about making pcg64's streams more independent is that I'm not sure I see a use case where it would be sensible to keep all other state the same and switch the stream to a different one (e.g., to a random value, let alone to one nearby). For pretty much all PRNGs, the right way to make a second one is to initialize it with fresh entropy.

I've described the use case earlier. There are strong UX reasons to write a stochastic program that accepts a single shortish "seed" input (i.e. something about the size they can copy-paste from an email onto a command-line) that then makes the output of the program deterministic. @stevenjkern noted to me in an offline conversation that that kind of interaction was essential to working with regulatory agencies that had to validate his software. If you had to use the file _output_ of a run of a program to replicate the result, that looks a little suspicious in such circumstances. The regulator would have to do a deep dive on the code (that may not actually be available to them) to ensure that the information in the file was really kosher.

We now have good tools in Python for spinning up N parallel processes dynamically to do some chunk of the work then gather the results and move on in the main process (then spin up M processes later, etc.). Unlike older, less flexible schemes like MPI, we don't just spin up a fixed N processes at the beginning. In those cases, I could see entropy-seeding each of the N PRNGs and saving them out to a file because there's just one place in the program that does that. At least from a programming perspective, that's not too hard. We have much more flexible tools for parallelism now. Seeding the PRNGs is now the bottleneck preventing us from using that flexibility in stochastic programs. There's no longer a single point of responsibility where we can put the file-based book-keeping.

The need to reproducibly derive N streams is strong enough that people will do weird things to get it with our current MT algorithm. I've had to shoot down a bunch of risky schemes, and I was hoping that PCG's streams would help us get there.

What do you think about using a good bijective hash over 2**63/2**127 to derive increments from a counter sequence 0,1,2,3,... while keeping the state the same? Do you foresee problems with that? What do you think about combining the hashed increment followed by a large jumpahead to move the state into a far part of the new cycle? Maybe we can move this sub-discussion to email or another issue and report back.

@rkern, there may be something that feels nice about about being able to give very short seeds to PRNGs, but (regardless of the PRNG), it's a terrible idea. If you provide _k_ bits of seed input, and then demand _k_ bits (either immediately, or skip _j_ bits first [for some arbitrary _j_ ] and then read _k_ bits), even though all 2^_k_ integers are valid inputs, not all 2^_k_ outputs can be observed (because the function from bits in to bits out not guaranteed to be [and really can't be] a bijection). The expected distribution is a Binomial(2^k,2^-k), which we can approximate as a Poisson distribution and thus 2^k/e values won't be observed. This is true no matter what PRNG is. All we can do is have _k_ be large enough that it's utterly impractical to figure out what's missing.

The problem is compounded when everyone is, say, using the same PRNG (e.g. the Mersenne Twister) and picking seeds from the same small set (e.g., numbers less than 10000), because rather than an arbitrary particular per-program bias, it is a _for-everyone doing this_ bias. For example, let's suppose you pick a four digit seed and then grab a reasonable number of numbers out of the Mersenne Twister, (say, less a million). In that situation, I can assure you that the unlucky number 13 _will never show up_ as any of the 10 billion outputs (in fact about 10% of 32-bit integers will be absent), and the number 123580738 is overrepresented by a factor of 16. This is exactly what we'd expect for a random sample of ten billion 32-bit integers, but it's a real problem _if everyone is using the same sample_. We would have an exactly analogous problem if everyone picks nine-digit seeds and only draws 10000 numbers.

The fact that lots of people want to do something doesn't make it a good idea. (That doesn't mean it's okay to tell merely people they're doing it wrong or want the wrong thing. You have to figure out what they actually need (e.g., reproducible results from a short-is command-line argument — possibly the right thing is to allow seeding from a UUID and a small integer; some ideas for how to scramble these things nicely to make seed data can be found in this blog post and made their way into randutils.)

(Here's the code to play with, since it's fairly short...)

// mtbias.cpp -- warning, uses 4GB of RAM, runs for a few minutes
// note: this is *not* showing a problem with the Mersenne Twister per se, it is
// showing a problem with simplistic seeding

#include <vector>
#include <iostream>
#include <random>
#include <cstdint>

int main() {
    std::vector<uint8_t> counts(size_t(std::mt19937::max()) + 1);
    for (size_t seed=0; seed < 10000; ++seed) {
        std::mt19937 rng(seed);
        for (uint i = 0; i < 1000000; ++i) {
            ++counts[rng()];
        }
    }
    size_t shown = 0;
    std::cout << "Never occurring: ";
    for (size_t i = 0; i <= std::mt19937::max(); ++i) {
        if (counts[i] == 0) {
            std::cout << i << ", ";
            if (++shown >= 20) {
                std::cout << "...";
                break;
            }
        }
    }
    std::cout << "\nMost overrepresented: ";
    size_t highrep_count = 0;
    size_t highrep_n = 0;
    for (size_t i = 0; i <= std::mt19937::max(); ++i) {
        if (counts[i] > highrep_count) {
            highrep_n = i;
            highrep_count = counts[i];
        }
    }
    std::cout << highrep_n << " -- repeated " << highrep_count << " times\n";
}

As I've said before, I think 128-bit seeds are short enough for that purpose, and I can build the tooling to help people write programs that do the right thing. Namely, entropy-sample them by default, printing them out or otherwise logging them, then allowing it to be passed in later. Your recommendation of generating a UUID for each program and mixing in a possibly smaller user-provided seed per run is also a good one.

Let's assume that I can get people to use good 128-bit seeds for the state part of PCG64, one way or the other. Do you have any comments on deriving streams from the same state? I'm not looking to draw more numbers, overall, than we would from a single PCG64 stream. I just want to be able to draw these numbers in different processes without coordination on each draw. Using an ad hoc 63-bit multiply-xorshift hash does seem to work pretty well so far (I'm at 32 GiB right now), for 8192 streams interleaved.

@imneme

It might be helpful to define what we mean by short.

I think that @rkern wrote "something about the size they can copy-paste from an email onto a command-line". I can represent pretty big numbers in a few characters like 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff.

I just want to be able to draw these numbers in different processes without coordination on each draw. Using an ad hoc 63-bit multiply-xorshift hash does seem to work pretty well so far (I'm at 32 GiB right now), for 8192 streams interleaved.

Have you tried interleaving n streams using a single quality seed by advancing the state by some large enough number (say 2**64, which is what is used by PCG64.jumped)? This seems like the simplest way to loosely coordinate large n streams across a cluster using something like PCG64(seed).jumped(node_id)

where node_id is 0,1,2,...

Is there a PRNG that is really good at producing simple independent streams using something like an index? I beleive that a MLFG can, but I did't like this since it was a 63-bit generator.

@bashtage, that really isn't the right way to do it. The right way is to take a seed, if you want to add in a small integer, use a hash function to hash it in. As mentioned earlier, I have previously (independent of PCG) written a serious mixing function [edit: fix link to the correct post] to mix various kinds of entropy large and small. You don't have to use mine, but I'd recommend you do something along those lines.

Ideally, you want a mechanism that is not specific to PCG. PCG may not be your default choice, and even if it were, you want people to do similar things with all generators. I don't think you should want a scheme for making several independent PRNGs that is dependent on streams or jump-ahead.

(oops, I linked to the wrong blog post; I've edited the previous message, but in case you're reading via email, I meant to link to this blog post)

@imneme Right now, all of the generators we have support a jump (some of which are really advance-type calls). I have no doubt that carefully seeding is a good idea, I suspect that many users will be tempted to use the PRNG.jumped() call. Is this something that should be dissuaded?

As for seeding, the MT generators all make use of the author's init routines, PCG uses yours, and the remainder so something like

seed = np.array(required_size, dtype=np.uint64)
last = 0
for i in range(len(user_seed))
    if i < len(user_seed)
        last = seed[i] = splitmix64(last ^ user_seed[i])
    else:
        last = seed[i] = splitmix64(last)

I imagine this could be improved.

Is this something that should be dissuaded?

I hadn't seen jumped. It's pretty horrible for the underlying LCG.

Suppose we have a multiplier, M, of 0x96704a6bb5d2c4fb3aa645df0540268d. If we calculate M^(2^64), we get 0x6147671fb92252440000000000000001 which is a terrible LCG multiplier. Thus if you took every 2^64th item from a 128-bit LCG it would be terrible (the low order bits are just a counter). PCG's standard permutation functions are designed to scramble the normal output of an LCG, not to scramble counters.

PCG64 is currently tested up to half a petabyte with Practrand and further analysis shows that you can read many petabytes without issues related to powers of two. Alas, if you jump-ahead to skip forward huge exact powers of two, PCG's usual (somewhat modest) permutations cannot sufficiently compensate for the pathological sequence from the skipping the underlying LCG huge distances like this. You can up the permutation-strength ante to fix this, and in fact both I and Vigna independently compared PCG's standard permutations against off-the shelf integer hash functions which would likely do so (after all, they're the foundation of SplitMix which _is_ just a counter). When I looked into it in 2014 with Fast Hash the speed didn't seem so great, but when Vigna did it more recently with murmurhash, he claimed the performance beat standard PCG (!).

If you really want to have a 2^64 jump-ahead, I think you need to switch to a stronger output function permutation (which as we've seen may be done at low cost). But if you feel that makes it not really “standard” PCG any more and want to keep the usual output permutation, then jumped() probably needs to go.

(BTW, pathological jump-ahead applies to other PRNGs, too. SplitMix is known to have some bad increments and it's reasonable to assume that although the usual increment (a.ka. “gamma”) of 0xbd24b73a95fb84d9 is fine, advancing by 2^32 will give you an increment of 0x95fb84d900000000, which isn't so good. For LFSRs, the bad jump-ahead probably isn't a power of two, but I'm fairly sure there will be jumps where the underlying matrix ends up pathologically sparse.)

I can confirm, at least with PCG32, 4 interleaved streams using .jumped() fails very quickly.

❯ ./pcg_streams.py --jumped -n 4 | time ./RNG_test stdin32
[
    {
        "bit_generator": "PCG32",
        "state": {
            "state": 10149010587776656704,
            "inc": 2891336453
        }
    },
    {
        "bit_generator": "PCG32",
        "state": {
            "state": 1158608670957446464,
            "inc": 2891336453
        }
    },
    {
        "bit_generator": "PCG32",
        "state": {
            "state": 10614950827847787840,
            "inc": 2891336453
        }
    },
    {
        "bit_generator": "PCG32",
        "state": {
            "state": 1624548911028577600,
            "inc": 2891336453
        }
    }
]
RNG_test using PractRand version 0.93
RNG = RNG_stdin32, seed = 0xeedd49a8
test set = normal, folding = standard (32 bit)

rng=RNG_stdin32, seed=0xeedd49a8
length= 128 megabytes (2^27 bytes), time= 2.1 seconds
  Test Name                         Raw       Processed     Evaluation
  BCFN(2+0,13-3,T)                  R= +58.7  p =  1.3e-27    FAIL !!!       
  BCFN(2+1,13-3,T)                  R= +48.0  p =  1.5e-22    FAIL !!        
  BCFN(2+2,13-3,T)                  R= +16.0  p =  2.3e-7   very suspicious  
  DC6-9x1Bytes-1                    R= +53.5  p =  1.8e-32    FAIL !!!       
  [Low8/32]DC6-9x1Bytes-1           R= +27.4  p =  1.1e-17    FAIL !         
  ...and 112 test result(s) without anomalies

Ideally, you want a mechanism that is not specific to PCG. PCG may not be your default choice, and even if it were, you want people to do similar things with all generators.

Well, that's what we are trying to decide here. :-) We were reasonably content with simply exposing whatever features each PRNG provided, on the assumption that the properties exposed by each algorithm are well studied. We're also reasonably content with saying "here's the default PRNG we recommend; it has a bunch of features that are useful; the others may not have them".

The notion of using a hash to derive a new state from a given state and a stream ID, for any algorithm, is interesting. Do you know how well studied that is? It sounds like a research problem to verify that it does work well for all of the algorithms. I'd hesitate to claim "here's the general procedure to derive independent streams for all of our PRNGs". I'm more content with "here's a common API for deriving independent streams; each PRNG implements it in whatever way is appropriate for the algorithm and may not implement it if the algorithm doesn't support it well".

On the other hand, if it's just a matter of allocating enough CPU cycles to test interleaved streams for each BitGenerator out to N GiB on PractRand, that's not too onerous.

The notion of using a hash to derive a new state from a given state and a stream ID, for any algorithm, is interesting. Do you know how well studied that is? It sounds like a research problem to verify that it does work well for all of the algorithms. I'd hesitate to claim "here's the general procedure to derive independent streams for all of our PRNGs". I'm more content with "here's a common API for deriving independent streams; each PRNG implements it in whatever way is appropriate for the algorithm and may not implement it if the algorithm doesn't support it well".

I don't know that you can call it “a research problem” (and thus “well studied”) exactly, but C++11 (which was mostly about using tried and true long-established techniques) provides the _SeedSequence_ concept (and the a specific std::seed_seq implementation) whose job is to provide seeding data to completely arbitrary PRNGs.

In general, almost all PRNGs expect to be initialized/seeded with random bits. There is nothing especially magical about random bits coming out of (say)random.org and random bits coming out of something more algorithmic (CS PRNG, hash function, etc.).

It's fairly straightforward to think about a collection of PRNGs from the same scheme all seeded with their own random bits. You can think of what we're doing as picking points (or actually intervals up to a certain maximum length corresponding to the amount of random numbers we ever expect to plausibly ask for, e.g., 2^56) on a line (e.g., a line with 2^255 points). We can calculate the probability that if we ask for _n_ intervals that one will overlap with another. It's fairly basic probability — I'm not sure you could get a paper published about it because (as I understand it) no one is ever excited about papers containing elementary math. (@lemire might disagree!)

[I'd argue that what you generally should _not_ do is seed a PRNG with random bits coming out of itself. That feels far too incestuous to me.]

Right, it's clear to me that using something like a well-designed _SeedSequence_ would be a way to take an arbitrary initial seed and drawing multiple starting points in our algorithm's cycle that ought not to overlap. And if that's the only real way to get independent streams, so be it. It'll be a matter of API design to make that convenient.

What I'm less clear on is how safe it is to take the current state of an initialized PRNG, hash-mix in the stream ID to jump to a new state in the cycle, which is what I thought you were suggesting (and it occurred to me later that I might have been wrong about that). Being well-separated in the cycle isn't the only factor, as the failure of jumped() shows. jumped() also assures that you are being sent to a far part of the sequence that won't overlap; it's just a part that can correlate very strongly with the initial part if the jump is not well-chosen. It can take some knowledge of the internals of each algorithm to know what is and isn't a good jump. We obviously did not for the case of PCG.

Fundamentally, if we think of PRNGs as transition functions and output functions, this new_state = seed_seq(old_state||streamID) is just another transition function that we are dropping in for one step. We have to be sure that the operations involved in that seed_seq are different enough from the transition functions in each PRNG algorithm (or their inverses) and maybe have to be sure of other things. I wouldn't want to use something built from, say, wyhash to initialize wyrand. As you say, you don't want to use the PRNG itself to provide the bits for itself. That's why I think it needs some study to assure that for all of our PRNGs (study that I was hoping not to have to do myself).

On the other hand, new_state = seed_seq(old_state||streamID) is probably no worse than off in this regard than the intended use of _SeedSequence_ for multiple streams: pull out two states in sequence. If so, then I'd be okay resting on C++'s experience, maybe with your implementation, and just doing some empirical tests with PractRand for all of our algorithms to show that they aren't worse off than their single-stream counterparts.

It would be really nice to get the hash-jump working because that opens up some usecases for spawning off PRNGs in a coordination-free manner. Using stream IDs requires some communication or pre-allocation. dask has asked for something like this in the past.

If there are good alternatives that only rely on good seeding that we can make convenient to do the right thing, then we should probably remove settable streams as criterion for the selection of the default. We just want a default algorithm that has a sufficiently large state space.

All that said, it does look like using a 63-bit hash to derive the PCG32 increment from sequential stream IDs (range(N)) seems to work. 8192 interleaved streams passes PractRand out to 2 TiB. If we do expose stream IDs for the PCG generators, we may want to use this technique to derive the increments, even if we suggest people use other means to reproducibly get independent streams.

What I'm less clear on is how safe it is to take the current state of an initialized PRNG, hash-mix in the stream ID to jump to a new state in the cycle, which is what I thought you were suggesting (and it occurred to me later that I might have been wrong about that).

I probably did express myself ambiguously, but no, I was never my intent to suggest that the current state of the PRNG should be used in self-reseeding.

But, FWIW, SplitMix does this, it's what it's split() operation does. And, I don't like that it does it.

This might be too much information, but it's I'll share a little about why I was horrified (perhaps more horrified than I should be) by SplitMix's split() function. As a historical note, SplitMix and PCG were designed independently around the same time (SplitMix was published on October 20, 2014, whereas the pcg-random.org went live in August, 2014 and linked to the PCG paper on September 5, 2014). There are some parallels between PCG and SplitMix (and various other PRNGs, including Vigna's xor shift* and xorshift+ — also released to the world in 2014). All have fairly simple not-quite-good-enough state transition functions fixed up by a scrambling output function. When I was writing the PCG paper, one thing I knew some people would like was a split() function but couldn't figure out a good way to do it; instead I developed a quick proof that if you had a _k_-bit PRNG where you could go left or right at each step, within _k_ steps you must be able to arrive at a state you've been in before, thus proving the whole concept was ill-conceived. That observation didn't make it into the paper. But as a result of my pondering ideas before that proof, in a footnote in a near-final-draft of the paper, I suggested, somewhat whimsically, because that PCG's output was a hash/scramble/permutation of its state, if you were feeling naughty you could reseed the generator with its own output and get away with it. I took it out of the final version because I thought that such whimsy would be too big of a red flag for reviewers, given that reseeding a PRNG with its own state was widely held to be the kind of misuse of a PRNG, and the sort of thing we see from people inexperienced in using them.

Reading the SplitMix paper, I found much to like, but I was very taken aback when I saw split(). It did something that I basically considered only as a joke and made it a tentpole feature. It wasn't until a few years later I got around to writing in more technical depth about what goes on when you have this kind of operation.

The overall take-away is that if you have a big enough state space (and SplitMix's is barely enough), you might be able to get away with self reseeding via a hash function. I still feel that this is not a good idea. Because random mappings (which is what we're dealing with in this situation) have properties like “with non zero asymptotic probability, the tallest tree in a functional graph is not rooted on the longest cycle”, I claim it's hard to have full confidence unless the designer has gone to the necessary work to show that such pathologies are not present in their design.

For fun, here's a dump of the state-space of a tiny version of SplitMix exploring just a three of different (and rigidly fixed) ways of combining next() and split():

Testing: SplitMix16: void advance() { rng = rng.split();}

Finding cycles...
- state 00000000 -> new cycle 1, size 4, at 000043b0 after 516 steps
- state 00000050 -> new cycle 2, size 41, at 00002103 after 2 steps
- state 000000cd -> new cycle 3, size 4, at 0000681a after 6 steps
- state 00000141 -> new cycle 4, size 23, at 00004001 after 11 steps
- state 00000dee -> new cycle 5, size 7, at 00007436 after 4 steps
- state 00008000 -> new cycle 6, size 90278, at 5e5ce38c after 46472 steps
- state 00030000 -> new cycle 7, size 6572, at 12c65374 after 10187 steps
- state 00030016 -> new cycle 8, size 3286, at 65d0fc0c after 402 steps
- state 00058000 -> new cycle 9, size 17097, at 2a2951fb after 31983 steps
- state 08040000 -> new cycle 10, size 36, at 08040000 after 0 steps
- state 08040001 -> new cycle 11, size 218, at 08040740 after 360 steps
- state 08040004 -> new cycle 12, size 10, at 38c01b3d after 107 steps
- state 08040006 -> new cycle 13, size 62, at 38c013a0 after 39 steps
- state 08040009 -> new cycle 14, size 124, at 08045259 after 24 steps
- state 08040019 -> new cycle 15, size 32, at 38c06c63 after 151 steps
- state 08040059 -> new cycle 16, size 34, at 38c00217 after 17 steps
- state 08040243 -> new cycle 17, size 16, at 38c06e36 after 13 steps
- state 123c8000 -> new cycle 18, size 684, at 77d9595f after 194 steps
- state 123c8002 -> new cycle 19, size 336, at 5de8164d after 141 steps
- state 123c9535 -> new cycle 20, size 12, at 123c9535 after 0 steps
- state 139f0000 -> new cycle 21, size 545, at 743e3a31 after 474 steps
- state 139f0b35 -> new cycle 22, size 5, at 139f0b35 after 0 steps
- state 139f1b35 -> new cycle 23, size 5, at 68d3c943 after 8 steps

Cycle Summary:
- Cycle 1, Period 4, Feeders 32095
- Cycle 2, Period 41, Feeders 188
- Cycle 3, Period 4, Feeders 214
- Cycle 4, Period 23, Feeders 180
- Cycle 5, Period 7, Feeders 12
- Cycle 6, Period 90278, Feeders 1479024474
- Cycle 7, Period 6572, Feeders 102385385
- Cycle 8, Period 3286, Feeders 5280405
- Cycle 9, Period 17097, Feeders 560217399
- Cycle 10, Period 36, Feeders 413
- Cycle 11, Period 218, Feeders 51390
- Cycle 12, Period 10, Feeders 1080
- Cycle 13, Period 62, Feeders 4113
- Cycle 14, Period 124, Feeders 4809
- Cycle 15, Period 32, Feeders 2567
- Cycle 16, Period 34, Feeders 545
- Cycle 17, Period 16, Feeders 87
- Cycle 18, Period 684, Feeders 95306
- Cycle 19, Period 336, Feeders 100263
- Cycle 20, Period 12, Feeders 7
- Cycle 21, Period 545, Feeders 163239
- Cycle 22, Period 5, Feeders 12
- Cycle 23, Period 5, Feeders 34

- Histogram of indegrees of all 2147483648 nodes:
      0  529334272
      1 1089077248
      2  528875520
      3     131072
      4      65536
Testing: SplitMix16: void advance() { rng.next(); rng = rng.split();}

Finding cycles...
- state 00000000 -> new cycle 1, size 36174, at 6b34fe8b after 21045 steps
- state 00000002 -> new cycle 2, size 4300, at 042a7c6b after 51287 steps
- state 0000000f -> new cycle 3, size 11050, at 0b471eb5 after 4832 steps
- state 0000001d -> new cycle 4, size 38804, at 2879c05c after 16280 steps
- state 00000020 -> new cycle 5, size 4606, at 46e0bdf6 after 7379 steps
- state 00046307 -> new cycle 6, size 137, at 0a180f87 after 89 steps
- state 00081c25 -> new cycle 7, size 16, at 177ed4d8 after 27 steps
- state 0044c604 -> new cycle 8, size 140, at 5e1f125b after 44 steps
- state 006e329f -> new cycle 9, size 18, at 006e329f after 0 steps
- state 13ebcefc -> new cycle 10, size 10, at 13ebcefc after 0 steps

Cycle Summary:
- Cycle 1, Period 36174, Feeders 975695553
- Cycle 2, Period 4300, Feeders 766130785
- Cycle 3, Period 11050, Feeders 110698235
- Cycle 4, Period 38804, Feeders 251133911
- Cycle 5, Period 4606, Feeders 43723200
- Cycle 6, Period 137, Feeders 4101
- Cycle 7, Period 16, Feeders 172
- Cycle 8, Period 140, Feeders 2310
- Cycle 9, Period 18, Feeders 124
- Cycle 10, Period 10, Feeders 2

- Histogram of indegrees of all 2147483648 nodes:
      0  529334272
      1 1089077248
      2  528875520
      3     131072
      4      65536
Testing: SplitMix16: void advance() { rng.next(); rng = rng.split(); rng = rng.split();}

Finding cycles...
- state 00000000 -> new cycle 1, size 40959, at 0069b555 after 49520 steps
- state 00000031 -> new cycle 2, size 1436, at 5f619520 after 2229 steps
- state 000003a4 -> new cycle 3, size 878, at 18d1cb99 after 1620 steps
- state 0000046c -> new cycle 4, size 2596, at 46ba79c0 after 1591 steps
- state 0000c6e2 -> new cycle 5, size 24, at 0212f11b after 179 steps
- state 000af7c9 -> new cycle 6, size 61, at 40684560 after 14 steps
- state 00154c16 -> new cycle 7, size 110, at 29e067ce after 12 steps
- state 0986e055 -> new cycle 8, size 4, at 2b701c82 after 7 steps
- state 09e73c93 -> new cycle 9, size 3, at 352aab83 after 1 steps
- state 19dda2c0 -> new cycle 10, size 1, at 78825f1b after 2 steps

Cycle Summary:
- Cycle 1, Period 40959, Feeders 2129209855
- Cycle 2, Period 1436, Feeders 5125630
- Cycle 3, Period 878, Feeders 7077139
- Cycle 4, Period 2596, Feeders 5997555
- Cycle 5, Period 24, Feeders 24221
- Cycle 6, Period 61, Feeders 1774
- Cycle 7, Period 110, Feeders 1372
- Cycle 8, Period 4, Feeders 23
- Cycle 9, Period 3, Feeders 4
- Cycle 10, Period 1, Feeders 3

- Histogram of indegrees of all 2147483648 nodes:
      0  829903716
      1  684575196
      2  468475086
      3  132259769
      4   32192209
      5      58402
      6      17026
      7       1982
      8        261
      9          1
Testing: SplitMix16: void advance() { rng.next(); rng.next(); rng = rng.split();}

Finding cycles...
- state 00000000 -> new cycle 1, size 55038, at 3e57af06 after 30005 steps
- state 00000005 -> new cycle 2, size 376, at 4979e8b5 after 6135 steps
- state 0000001e -> new cycle 3, size 10261, at 0cd55c94 after 1837 steps
- state 0000002d -> new cycle 4, size 3778, at 7f5f6afe after 3781 steps
- state 00000064 -> new cycle 5, size 2596, at 3bc5404b after 5124 steps
- state 0000012b -> new cycle 6, size 4210, at 525cc9f3 after 397 steps
- state 00000277 -> new cycle 7, size 1580, at 410010c8 after 1113 steps
- state 00001394 -> new cycle 8, size 916, at 7b20dfb0 after 193 steps
- state 00063c2d -> new cycle 9, size 51, at 6e92350b after 121 steps
- state 058426a6 -> new cycle 10, size 8, at 058426a6 after 0 steps
- state 0e5d412d -> new cycle 11, size 1, at 0e5d412d after 0 steps
- state 4c2556c2 -> new cycle 12, size 1, at 4c2556c2 after 0 steps

Cycle Summary:
- Cycle 1, Period 55038, Feeders 2027042770
- Cycle 2, Period 376, Feeders 28715945
- Cycle 3, Period 10261, Feeders 49621538
- Cycle 4, Period 3778, Feeders 13709744
- Cycle 5, Period 2596, Feeders 15367156
- Cycle 6, Period 4210, Feeders 10418779
- Cycle 7, Period 1580, Feeders 1782252
- Cycle 8, Period 916, Feeders 744273
- Cycle 9, Period 51, Feeders 2351
- Cycle 10, Period 8, Feeders 24
- Cycle 11, Period 1, Feeders 0
- Cycle 12, Period 1, Feeders 0

- Histogram of indegrees of all 2147483648 nodes:
      0  529334272
      1 1089077248
      2  528875520
      3     131072
      4      65536

etc.

Ah, great. I shot down a couple of proposals along those lines myself a few years ago, so I was afraid that we missed out on something good. :-)

Any final thoughts on the hash approach for deriving increments for PCG streams? Setting aside whether that's the main mechanism for getting independent streams. Short of removing access to that feature altogether, that seems like something we'd want to do to prevent the easy-to-misuse sequential stream IDs.

Out of curiosity, is there any (easy) way to tell how far apart two states are in PCG64?

Out of curiosity, is there any (easy) way to tell how far apart two states are in PCG64?

Yes, though we don't expose it: http://www.pcg-random.org/useful-features.html#distance

In the C++ source, the distance function will even tell you the distance between streams, giving their point of closest approach (where the only difference between the streams is an added constant).

Incidentally, for the underlying LCG, we can use the distance to work out how correlated we expect the to positions to be. A short distance is obviously bad (and is bad for any PRNG at all), but a distance with just one bit set isn't great either, which is why jumping ahead by to 2^64 (0x10000000000000000) with .jumped is a bad idea. On my PCG to-do list is writing an “independence_score” function that looks at the distance between two states and tells you how random-looking the distance is (via hamming weight, etc — ideally we want about half the bits to be zeros and half ones and them to be liberally scattered).

One way to _keep_ jumped with PCG64 would be to not jump by n * 0x10000000000000000 but instead jump by n * 0x9e3779b97f4a7c150000000000000000 (truncated to 128 bits). This will give you all the usual properties you'd want (.jumped(3).jumped(5) == .jumped(8)) without being pathological for the underlying LCG.

(I'm also aware that saying “don't advance by 0x10000000000000000” is something of a “well, don't hold it that way response” and I'm not very satisfied with it. Sure, it's cool that independence_score can exist, but this whole thing (and the issue with similar streams), may argue for a stronger default output function so I that even if people do (rare) things that are utterly pathological for the underlying LCG, no harm will be done. PCG is coming up on being five years old at this point, and I am considering a version bump and tweaks this summer, so this issue may make the list. Of course, it may annoy you folks if just as you put PCG in, I make a major version bump and improve it.)

Any final thoughts on the hash approach for deriving increments for PCG streams? Setting aside whether that's the main mechanism for getting independent streams. Short of removing access to that feature altogether, that seems like something we'd want to do to prevent the easy-to-misuse sequential stream IDs.

I'd recommend you put it through the Murmur3 mixer. No one is likely to accidentally make similar streams with that without deliberate effort. (Edit: I guess you need a 128-bit version, but you could just mix the top and bottom halves. I'd also add a constant, too. Everyone loves 0x9e3779b97f4a7c15f39cc0605cedc835 (fractional part of ϕ) but 0xb7e151628aed2a6abf7158809cf4f3c7 (fractional part of e) would be fine too, or _any_ random-looking number.)

I recommend wyhash (https://github.com/wangyi-fudan/wyhash) as it is the fastest and simplest one that passed BigCrush and PractRand. The c code is as simple as

inline  uint64_t    wyrand(uint64_t *seed){    
    *seed+=0xa0761d6478bd642full;    
    __uint128_t t=(__uint128_t)(*seed^0xe7037ed1a0b428dbull)*(*seed);    
    return  (t>>64)^t;    
}

@wangyi-fudan, I can't convince myself that this is a bijection.

Sorry for my limited knowledge: why is bijection necessary/favored for a PRNG?
will be appreciated for some explainations :-) @imneme

@wangyi-fudan, if a hash function from 64-bit ints to 64-bit ints is not a bijection (i.e., a 1-to-1 function) then some results are generated more than once and some not at all. That is a kind of bias.

i understand what you mean. however, for a 64bit random number generator R, we will expect one collision after 1.2*2^32 random numbers (http://mathworld.wolfram.com/BirthdayAttack.html) . with 2^64 random numbers it is natural to have many collisions. collisions are natural while bijection is not naturally random. If i known the gambling table (eg a 3 bit PRNG) is determined have a 0 value within 8 trails, I will dare to make a big bet on zero after observed 5 non-zero.

@wangyi-fudan, In this context, we were talking about ways to permute the stream-id so that streams like 1,2,3 become something more random looking (a.k.a. more normal). There is no virtue in collisions in this process.

For PRNGs in general, you should read up on the difference between PRNGs based on random mappings and ones based on random invertible mappings (1-to-1 functions). I've written about it, but so have others. It small sizes, PRNGs based on random mappings will show bias and fail more quickly than ones based on other techniques. At large sizes, flaws of all kinds may be harder to detect.

We can calculate the probability that if we ask for _n_ intervals that one will overlap with another. It's fairly basic probability — I'm not sure you could get a paper published about it because (as I understand it) no one is ever excited about papers containing elementary math.

I think you just have to be Pierre L'Ecuyer. ;-) page 15

Yes, when he explains the basics it's considered okay!

@rkern @imneme Simplicity is a feature, both in software and in mathematics. That some are unimpressed by simple work should not be taken as contradictory evidence.

@lemire: There's a humor piece I like that I think has a lot of truth to it called _How To Criticize Computer Scientists_. The underlying idea behind the piece is that theorists favor sophistication and experimentalists favor simplicity. So if your audience is one of experimentalists, they'll be delighted by simplicity, but if your audience is one of theorists, not so much.

The default BitGenerator is PCG64. Thank you all for your thoughtful contributions. And stamina!

Very much inspired by this thread, I have some news to report…

Background

By many measures pcg64 is pretty good; for example, under the usual measures of statistical quality, it gets a clean bill of health. It's been tested in various ways; most recently I've run it all the way out to half a petabyte with PractRand. It works well in normal use cases.

BUT, the pathologies that came up in this thread didn't sit well with me. Sure, I could say “well, don't hold it that way”, but the whole point of a general purpose PRNG is that it ought to robust. I wanted to do better...

So, about 25 days ago I began thinking about designing a new member of the PCG family…

Goal

My goal was to design a new PCG family member that could be a drop in replacement for the current pcg64 variant. As such:

  • The output function should scramble the bits more than XSL RR (because doing so will avoid the issues that came up in this thread).
  • The performance should be about as fast (or faster) than the current pcg64.
  • The design must be PCG-ish (i.e., don't be trivially predictable, and thus don't allow _any_ of the work of the output function to be easily undone).

As always there is a trade-off as we try to get the best quality we can as quickly as we can. If we didn't care at all about speed, we could have more steps in the output function to produce more heavily scrambled output, but the point of PCG was that the underlying LCG was “almost good enough” and so we didn't need to go to quite as much effort as we would with something like a counter incrementing by 1.

Spoiler

I'm pleased to report success! About 25 days ago when I was first thinking about this I was actually on vacation. When I got back about ten days ago, I tried the ideas I had and was pleased to find that they worked well. The subsequent time has mostly been spent on various kinds of testing. Yesterday was satisfied enough that I pushed the code into the C++ version of PCG. Tests at small sizes indicate that it is much better than XSL RR, and competitive with RXS M, but it actually shines at larger sizes. It meets all the other goals as well.

Details

FWIW, the new output function is (for the 64-bit output case):

uint64_t output(__uint128_t internal)
{
    uint64_t hi = internal >> 64;
    uint64_t lo = internal;

    lo |= 1;
    hi ^= hi >> 32;
    hi *= 0xda942042e4dd58b5ULL;
    hi ^= hi >> 48;
    hi *= lo;
    return hi;
}

This output function is inspired by xorshift-multiply, by which is widely used. The choice of multipliers is (a) to keep the number of magic constants down, and (b) to prevent the permutation being undone (if you don't have access to low-order bits), and also provide the whole “randomized-by-itself” quality that PCG output functions typically have.

Other changes

It's also the case that 0xda942042e4dd58b5 is the LCG multiplier for this PRNG (and all cm_ prefixed 128-bit-state PCG generators). As compared to 0x2360ed051fc65da44385df649fccf645 used by pcg64, this constant is actually still fairly good in terms of spectral-test properties, but is cheaper to multiply by because 128-bit × 64-bit is easier than 128-bit × 128-bit. I've used this LCG constant for several years without issue. When using the cheap-multiplier variant, I run the output function on the pre-iterated state rather than the post-iterated state for greater instruction-level parallelism.

Testing

I've tested it thoroughly (PractRand and TestU01) and I'm happy with it. Tests included scenarios outlined in this thread (e.g., taking a gang generators either on sequential steams or advanced by 2^64 and interleaving their output — I tested a gang of four and a gang of 8192 out to 8 TB with no issues, as well as a stream and its opposite-land counterpart).

Speed

I could go on at length about speed tests and benchmarks. There are all sorts of factors that influence whether one PRNG runs faster than another in a given benchmark, but overall, this variant seems to often be a little faster, sometimes a lot faster, and occasionally a little slower. Factors like the compiler and the application have a much greater impact on benchmark variability.

Availability

Users of the C++ header can access this new family member _now_ as pcg_engines::cm_setseq_dxsm_128_64; at some point in the future, I'll switch pcg64 from pcg_engines::setseq_xsl_rr_128_64 to this new scheme. My current plan is to do so this summer as part of a PCG 2.0 version bump.

Formal Announcements

Overall, I'm very happy with this new family member and at some point later in the summer, there will be blog posts with more detail, likely referencing this thread.

Your Choices...

Of course, you have to work out what to do with this. Regardless of whether you'd use it or not, I'd actually be pretty curious to see whether it does better or worse in your speed benchmarks.

@imneme Does it get around the need to have a fast full 64-bit multiplier? (Which is super fast on x64 but a tad slower on some weaker architectures.)

@lemire: 128-bit × 64-bit multiply, which internally will be done with two multiply instructions on x86 (a 64-bit × 64-bit → 128-bit result on the low-order bits, and 64-bit × 64-bit on the high order bits. Both multiplies can take place in parallel, and then the two results need to be added.)

It's still potentially better than 128-bit × 128-bit. Although how much better depends on how well instruction scheduling goes at that moment.

You're right that on ARM, 64-bit × 64-bit → 128-bit result is actually two instructions.

(It's totally possible, of course, to gang together two 64-bit LCGs and mix them. The space of all the PRNGs that could exist and would work well is pretty huge.)

A quick-and-dirty implementation in our framework suggests a mild performance improvement, at least on 64-bit Linux:

Time to produce 1,000,000 64-bit unsigned integers
************************************************************
MT19937      5.42 ms
PCG64        2.59 ms
PCG64DXSM    2.41 ms
Philox       4.37 ms
SFC64        2.07 ms
numpy        5.41 ms
dtype: object

64-bit unsigned integers per second
************************************************************
MT19937      184.39 million
PCG64        386.01 million
PCG64DXSM    415.02 million
Philox       228.88 million
SFC64        483.94 million
numpy        184.79 million
dtype: object

I think it's outputting from the pre-iterated state that gives it the bump, at least in this context. I left that out at first and got essentially the same performance as PCG64 1.0. The real win would be under 128-bit emulation, I suspect, but I didn't get around to writing that up, and I don't have a good way to test it on the platforms that matter.

I guess the real question for you, @imneme, is how annoyed will you be with the name numpy.random.PCG64 implementing the 1.0 algorithm? The release is imminent and already delayed, so I don't _think_ we're going to change the algorithm at this time. If the performance on 32-bit platforms is particularly good, then I think we might add PCG64DXSM in a following release, and maybe reconsider the default a few releases down the line.

It's your choice to make!

I have no problem with your shipping the 1.0 version of PCG64. Plenty of other folks have used that variant.

I do think the DXSM variant has the advantage of avoiding the edge-case-usage issues that came up in this thread (after all, that's pretty much why it exists), but on the other hand, it has the disadvantage that it's late to the party. It might even seem quite reckless to ship a PRNG that is less than a month old out to users (even though it is based on the same ideas as the more time-tested PCG variants).

(That said, if it were _my_ choice, notwithstanding possible accusations of recklessness, I'd probably ship the new one; I think the delay to get it up and running in Numpy is pretty minimal. And the risk is very low — it's already throughly tested with BigCrush, and tested with PractRand out to 16 TB (including cm_mcg_dxsm_64_32 which is a quarter of the size [no streams, 32-bit output]), and will likely hit 32 TB in less than a week.)

[Glad the performace did get a tad better. Five years ago, the using pre-iterated state was a pessimization for 128-bit sizes with 128-bit multipliers. But that was then, on the machines I was testing on, with the benchmarks I was using.]

I meant more about using the name PCG64 for the 1.0 variant when you are going to be using that name to refer to the 2.0 variant.

@rkern If it is just a naming issue, then PCG64DXSM and PCG64 sets them apart nicely, no?

For numpy, certainly. I am just wondering if @imneme would prefer that we not name our 1.0 implementation PCG64 when she is going to be promoting the 2.0 variant under that name in the C++ version. I am sensitive to the fact that being loose with the names means that some people might test numpy's PCG64 and compare that to the claims that will be made on pcg-random.org about the 2.0 version. C.f. just about any conversation about Bob Jenkin's PRNGs.

In Section 6.3 of the PCG paper, it says:

Note also that although the generators are presented with mnemonic names based on the permutations they perform, users of the PCG library should rarely select family members by these mnemonics. The library provides named generators based on their properties, not their underlying implementations (e.g., pcg32_unique for a general-purpose 32-bit generator with a unique stream). That way, when future family members that perform even better are discovered and added (hopefully due to the discoveries of others), users can switch seamlessly over to them.

And the C and C++ libraries are structured that way. The libraries provide

  • a low-level interface that lets you pick a specific family member via the name of its permutation, the bit sizes it operates at, and the characteristics of the underlying LCG
  • a high level interface that provides convenient aliases like pcg64 that connect to a pre-chosen low-level family member.

In this way, the aliases can be updated to point to newer family members, but users who want to exactly reproduce older results will still be able to by using the low-level interface to select the family member that was previously reachable by a convenient high-level alias.

If you're going to ship a PRNG called PCG64, I'd say it is sufficient to say in your documentation which specific PCG variant that is — in other words, say which family member it corresponds to in the low-level C or C++ library interface.

The default generator is implemented is implemented as np.random.default_gen() in https://github.com/numpy/numpy/pull/13840. (@rkern for future reference, it's probably good to explicitly call out PRs -- they are easy to miss in GitHub if you only provide a back-link, since there are no notifications for that.)

One minor nit: what about calling this np.random.default_generator() instead? gen feels too short/non-obvious to me. I would be curious what others think.

what about calling this np.random.default_generator() instead?

I had the same thought, but then, np.random.default_generator() is a hair on the long side, so I played with default_rng.

👍 I like default_rng better than default_gen, too. I would be happy with either of these, though I would still lean towards default_generator.

:+1: for default_rng().

Was this page helpful?
0 / 5 - 0 ratings