Pytorch: RFC: Add torch.deterministic flag to force deterministic algorithms

Created on 18 Dec 2018 · 67Comments · Source: pytorch/pytorch

🚀 Feature

We should add a global variable to force PyTorch to use bitwise deterministic algorithms. Soumith suggests adding the flag to a torch.experimental sub-package because we're not sure about some of the details.

Motivation

Bitwise determinism between runs is sometimes useful for debugging. However, it's difficult to write efficient deterministic algorithms for some operations.

Pitch

When torch.experimental.deterministic is False (the default), PyTorch should use the fastest algorithm available for a given operation. When torch.experimental.deterministic is True, PyTorch should only use deterministic algorithms. PyTorch should issue a warning if we don't have a deterministic algorithm available for a given operation and torch.experimental.deterministic is True.

cuDNN

We already have a torch.backends.cudnn.deterministic flag to control cuDNN algorithm choices. We should keep this flag for now and restrict cuDNN to deterministic algos if either torch.backends.cudnn.deterministic or torch.experimental.deterministic is True.

Non-goals

We only aim for bitwise determinism between runs on machines with the same architecture and configuration. For example, even when torch.experimental.deterministic is True we do not aim for bitwise determinism when any of the following varies:

PyTorch version
CPU architecture (e.g. x86 with AVX vs. ARM)
GPU architecture (e.g. AMD vs. NVIDIA or P100 vs. V100)
Library dependencies (e.g. OpenBLAS vs. MKL)
Number of OpenMP threads

Implementation suggestions

I suggest adding this feature in two steps. The first step is to add the torch.backends.cudnn.deterministic flag and add warnings to any non-deterministic operations. The second step is to add deterministic implementations for the non-deterministic operations.

There is a partial list of non-deterministic operations in the PyTorch docs.

Open questions

How should torch.experimental.deterministic interact with the RNG seed? Should it set a default seed if no manual seed has been set? Should it issue a warning if no manual seed has been set?

cc @ezyang @gchanan @zou3519

feature high priority determinism internals triaged

Source

colesbury

Most helpful comment

Hi, I want to talk about the plan going forward for torch.deterministic. There are a few high-level questions that we need to answer:

What are the semantics of torch.deterministic? What does the user expect? Is best effort actually useful for a user? If it is not useful, is it better to define torch.deterministic in terms of what operations it controls?
Now that we have the torch.deterministic flag, does it make sense to eliminate the deterministic= keyword argument entirely from public facing API (bmm, I'm looking at you).
What is the end game for this work? How much of this are you (@kurtamohler) going to work on, versus the generic community, and when we get to the end of your stint here, what does a reasonable state look like?

Starting with (1), the current documentation for torch.deterministic says:

     r"""Sets a global flag to force all operations to use a deterministic
    implementation if available. If an operation that does not have a
    deterministic implementation is called while this setting is True, the
    operation will throw a RuntimeError.

    Note that deterministic operations tend to have worse performance than
    non-deterministic operations.

While this may be true for an eventual end state, this inaccurately represents the current situation, where a lot of operations have not been audited and for any given model, we don't know if torch.deterministic will actually do what it says on the tin and make your model deterministic / raise an error when you hit nondet. So basically, our implementation is buggy with respect to these semantics, and will continue to be buggy for the forseeable future. This is not a great state to be in.

We could change the documentation of torch.deterministic to ameliorate this. Some possible changes:

torch.deterministic is best effort, but please report bugs if you see that it doesn't catch some non-determinism
torch.deterministic toggles the behavior of these operators (and then give an exhaustive list of the operators it toggles)

The second bullet point leads to (2): if torch.deterministic now exists as a way to toggle determinism, it is much less important to support determinism directly in the user API. So we probably shouldn't have added the deterministic argument to bmm. We might consider exposing an internal function if you want to toggle something directly, but deterministic shouldn't be available directly on the function itself.

What do you think? I think changing the docs is probably the easiest way to get on a sustainable path. There are some other details, like how to populate the exhaustive list, but these semantics probably make more sense than "ideal" semantics that aren't actually going to be true.

cc @gchanan @mruberry

ezyang on 22 Jul 2020

👍3

All 67 comments

This is a thumbs up from me. Problem will primarily be how to actually roll this out everywhere in the codebase; nothing worse is to claim that we're deterministic, but then secretly it's not :)

ezyang on 18 Dec 2018

I'm all for it and my approach would be to flag ops and error when deterministic is on and we know they're not.

t-vi on 18 Dec 2018

I think erroring on non-deterministic ops is too harsh. Warning seems like a smoother experience

fmassa on 18 Dec 2018

👍1

I think the default should be to throw, but I guess we could support a multi-valued property there (non-deterministic is ok, warn, throw).

apaszke on 29 Dec 2018

👍1

I must admit I don't really see the use-case of a warning. When people care about deterministic enough to switch it on, they'd probably expect the error. You could always switch it off for certain calls to say that you're OK with whatever nondeterminism is in there.

t-vi on 21 Feb 2019

Error, warning, proper documentation...
The latter is a must.
Warning or error? I'll go with an error.

ranshadmi on 22 Feb 2019

throwing seems great. I agree with Adam that giving an option to warn instead of throw seems reasonable.

soumith on 23 Feb 2019

Thanks for weighing in. In the end, the main effort for the ternary flag is the flag itself, and that's not hard.
I'll add a flag to Context.h and sprinkle (via an utility function) the AT_ERROR and AT_CHECK.

t-vi on 23 Feb 2019

Hello,
Any news an this flag?
Determinism is crucial.
From my experience, current version allows determinism over one gpu, up to a precision 1e-16, using fixed seeds. Note that infinitesimal difference may be amplified and diverge the results.

Please, consider the case of multigpu as well (at least for a fixed K gpus, the behavior needs to be deterministic. I am able to achieve some kind of determinisim that breaks down from time to time for a reason I do not understand for now (using nightly build 1.2.0.dev20190616).). I am struggling with it right now (1, 2).

Thank you!

sbelharbi on 17 Jun 2019

@t-vi are you actively working on this?

umanwizard on 17 Jun 2019

I don't want to keep you from doing it.

t-vi on 17 Jun 2019

@t-vi Sorry if I was not clear, I am not planning on working on this :) . Just was trying to understand whether anyone was actively doing so.

umanwizard on 17 Jun 2019

After almost one year, the problem of non-deterministic interpolation is still not solved.

Hope the community to add this feature :)

CoinCheung on 5 Nov 2019

Maybe a deterministic interpolation would bring great help to users.

CoinCheung on 5 Nov 2019

👍1

~I didn't really advertise it yet, but given that there seems to be more user interest than developer resources allocated, I have this listed as a project that you can vote on in my github sponsorship page when I set this up.
I'm quite certain we could make good progress by the end of the year and interpolation certainly is one of the things that I have a plan how to fix (similar to the pseudocode for fold that I is somewhere in the issues) but just isn't high up on my own priorities list.~
Turned out to be not interesting.

t-vi on 5 Nov 2019

a deterministic interpolation will be a huge help. link

sbelharbi on 5 Nov 2019

Bumping priority, esp for CUDA, based on user feedback

ezyang on 27 Jan 2020

I'm glad it's being fixed, thank you!

t-vi on 27 Jan 2020

@t-vi to be fair, I don't think "bumping priority" is equivalent to "it's being fixed" :).

gchanan on 27 Jan 2020

Looking forward to the solutions！

clw5180 on 2 Feb 2020

colesbury mentioned that one killer reason for deterministic algorithms is not because determinism is actually the problem, but that you can rule it out when you turn this on ;)

ezyang on 3 Feb 2020

How should torch.experimental.deterministic interact with the RNG seed? Should it set a default seed if no manual seed has been set? Should it issue a warning if no manual seed has been set?

I'd suggest not setting a seed if none has been set by the user. For one because it couples two interfaces which isn't needed (users who care about determinism will understand RNGs very well I'd think). More importantly, this is very hard to do reliably; one can use an RNG in multi-process/threaded applications, have other torch.Generator subclasses, be using numpy.random as well, etc.

Not sure about a warning, only if there's a sane place to set it (e.g. are you then forcing to seed before determinism=True rather than in the same module/function where an RNG is used?).

rgommers on 5 Feb 2020

👍1

I am just curious that when I set torch.backends.cudnn.deterministic=True, the interpolation operator still cannot be deterministic. Does pytorch interpolation not use cudnn ?

CoinCheung on 13 Feb 2020

It may not. You can nvprof your run of interpolate to check for certain.

ezyang on 14 Feb 2020

I'm wondering whether or not we should continue providing the deterministic arguments in function calls once torch.experimental.deterministic is implemented. Maybe we should, because a user might prefer determinism for some operations and speed for other operations.

If we do keep the arguments, then what happens if torch.experimental.deterministic and a function's deterministic flag oppose each other. Should torch.experimental.deterministic = True mean "use determinism in all cases no matter what", or should it mean "use determinism as a default value, but if the deterministic argument is specified in a function call, then use that setting for that specific function call." In other words, how should the code below be handled? Does someone know how the torch.backends.cudnn.deterministic flag acts in a similar situation?

torch.experimental.deterministic = True
torch.some_operation(deterministic=False)

kurtamohler on 20 Feb 2020

@kurtamohler Good question. I think the easiest fix is to make it bool? deterministic=None, and then interpret None to mean "respect torch.experimental.deterministic", and otherwise use exactly what the user requested.

We kind of have a similar situation with convolution, but the way it was done there was that there is a convolution with no benchmark argument, and then a _convolution with an explicit benchmark.

I think either of these solutions would be acceptable; however, the convolution approach has an added benefit of not leaking the internal deterministic flag to the user-visible API (unless they use an internal API).

ezyang on 20 Feb 2020

What's the rationale for "I want to be deterministic everywhere, but _not in this particular operator_"? Is this really supposed to be a common enough use case to warrant adding an extra input to many of our operators (and most of the complex ones)? IMO it would be better to provide context managers for toggling determinism.

apaszke on 4 Mar 2020

@apaszke , yeah I think you're right that it would be better to just use context managers to toggle determinism. I wouldn't say that we should add the deterministic argument to any operators, but some operators already have it. Would it be best to remove all of those and break BC, or would it be best to keep them around and allow them to override torch.experimental.deterministic?

kurtamohler on 4 Mar 2020

I'd say that we should remove it or make it private at least (i.e. underscore prefix or sth).

apaszke on 4 Mar 2020

I'm wondering if the deterministic feature for interpolate function is closed and will not be implemented ?

twangnh on 28 Apr 2020

Nope, we are amenable to deterministic versions of ALL functions in PyTorch

ezyang on 29 Apr 2020

👍2

@ezyang which pytorch version has the deterministic F.interpolate function? is it starting from pytorch 1.6? or is it available in the latest stable version (1.5)? or do I have to download and install Pytorch from source?

ghost on 11 May 2020

I'd be happy to start working on this

kurtamohler on 12 May 2020

The above commit only adds the flag, it doesn't affect any operations yet. I'd appreciate if someone could take a few minutes to look at it and let me know if I did anything incorrectly or if anything could be improved so far. I based this off of how torch.backends.cudnn.deterministic is implemented.

kurtamohler on 15 May 2020

This looks OK but I feel like the internal naming shouldn't include experimental (since, ostensibly, some day you want to make it not experimental, and that shouldn't involve having to rename all the implementaiton bits!)

ezyang on 15 May 2020

@ezyang, yeah that makes sense, I'll rename.

kurtamohler on 15 May 2020

I added a torch.experimental.deterministic_error_level, similar to what @t-vi did in his previous work on this issue. deterministic_error_level controls the error/warning behavior if deterministic == True and a given function does not have a deterministic implementation. It can be set to 2 (error), 1 (warn), or 0 (silent).

If the user sets it to any other value, I want to throw a catchable python runtime exception. Usually, I would use TORCH_CHECK() for that kind of behavior, but in this case, the exception isn't catchable and I'm not sure why. Here's the TORCH_CHECK() call: link

This is what happens when that check fails:

>>> import torch
>>> try:
...     torch.experimental.deterministic_error_level=50
... except:
...     print('exception caught')
... 
terminate called after throwing an instance of 'c10::Error'
  what():  error level 50 is invalid, must be one of 0: None, 1: Warn, or 2: Error
Exception raised from longToErrorLevel at ../aten/src/ATen/Context.cpp:85 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x58 (0x7f53e2cc0878 in /work/kurtamohler/development/pytorch-deterministic-flag/torch/lib/libc10.so)
frame #1: at::Context::longToErrorLevel(long) + 0x122 (0x7f53f6d61a82 in /work/kurtamohler/development/pytorch-deterministic-flag/torch/lib/libtorch_cpu.so)
frame #2: THPModule_setDeterministicErrorLevel(_object*, _object*) + 0x31 (0x7f53fb5625d1 in /work/kurtamohler/development/pytorch-deterministic-flag/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: __libc_start_main + 0xe7 (0x7f5432d62b97 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

If anyone knows how I can fix that, please let me know.

kurtamohler on 18 May 2020

@kurtamohler is THPModule_setDeterministicErrorLevel missing HANDLE_TH_ERRORS / END_ HANDLE_TH_ERRORS macros? They're needed to catch the C++ exception and translate it into a Python error return.

colesbury on 18 May 2020

Ah that was it, thanks @colesbury!

kurtamohler on 18 May 2020

I'm starting to add the non-deterministic alert to all the callers of atomicAdd. I noticed that some callers only use atomicAdd in certain cases. For instance, adaptive_avg_pool3d_backward only uses if (isizeW%osizeW != 0) || (isizeH%osizeH != 0) || (isizeT%osizeT != 0) is true. Should I only alert in these cases and try to convey them in the error message, or would it be alright to just alert whenever these functions are called whether or not atomicAdd ends up being used?

kurtamohler on 13 Jun 2020

It's probably easier to implement and easier to understand if you unconditionally alert.

ezyang on 15 Jun 2020

@ngimel , I've been thinking about how to use CUBLAS_WORKSPACE_CONFIG to ensure deterministic stream usage, and I think there are two main approaches that should be considered.

If someone is using one of the affected CUDA versions (10.2 or higher at the moment), and torch.set_deterministic(True) is called, use std::getenv to make sure that CUBLAS_WORKSPACE_CONFIG is either :16:8 or :4096:8. If not, do either (1) or (2):

Throw an error telling the user to set the variable appropriately.
Automatically set the variable with putenv (_putenv on Windows). However, there are some further design decisions associated with this. Should we choose :16:8 (lower performance, but less memory usage) or :4096:8 (higher performance, but more memory usage)? Also, if the user set the variable to some other non-deterministic value, we would either have to keep track of the original value and restore it if torch.set_deterministic(False) is called, or we could throw an error telling the user that they need to unset the variable, or some other scheme.

Also, I don't know whether setting the variable while the application is already running will actually have any affect, so I don't know for sure if option (2) is even possible. The variable might only be checked once, when the CUDA runtime starts or when a cuBLAS handle is created. I couldn't find information about this, so I'd probably have to find out experimentally (I'm going to have to use a non-deterministic stream usage reproducer to write a test either way, so I'll look into this). I also looked for an API call, rather than using the environment variable, but CUDA doesn't seem to offer one.

Do you have a strong opinion about which option would be better? Option (2) would probably be more user-friendly, but possibly less transparent than option (1).

kurtamohler on 7 Jul 2020

I don't know whether setting the variable while the application is already running will actually have any affect

To follow up on this question, setting the environment variable inside a pytorch script does not seem to affect the CUDA stream's determinism. I modified the script from https://github.com/pytorch/pytorch/issues/39849 to run multiple times and compare training stats to check for non-deterministic behavior. It tries to set CUBLAS_WORKSPACE_CONFIG=:4096:8 to ensure deterministic stream usage: https://github.com/kurtamohler/pytorch-perf-test-scripts/blob/master/nondeterministic_alert/cuda_stream_nondeterminism.py

Running it shows that we do not get deterministic behavior from setting the variable inside the script:

$ python cuda_stream_nondeterminism.py 
Before setting var: not deterministic
After setting var: not deterministic
After restoring old var: not deterministic

But running it with the environment variable set outside of the script does make it deterministic:

$ CUBLAS_WORKSPACE_CONFIG=:4096:8 python cuda_stream_nondeterminism.py 
Before setting var: possibly deterministic
After setting var: possibly deterministic
After restoring old var: possibly deterministic

Note, it prints "possibly deterministic" because I only run the training function 5 times, and its possible to get lucky even if behavior is not really deterministic.

Maybe if I could reinitialized the cuda stream, that would force it to honor the changed CUBLAS_WORKSPACE_CONFIG variable. I'd like to try that, but I don't know how or if it's even possible to do that at runtime. If someone knows, please let me know.

kurtamohler on 11 Jul 2020

I found out that I can create and use a new stream with:

with  torch.cuda.stream(torch.cuda.Stream()):

But the new stream doesn't honor the changed environment variable setting. I also found torch.cuda.init(), but unfortunately, that's a no-op if cuda has already been initialized.

So unless we can think of something else to try, it looks like we probably can't change the workspace config automatically, so we might just have to throw an error telling the user to set it.

kurtamohler on 11 Jul 2020

Yep, setting environment variable after cuda context has been initialized has no effect, so unfortunately it's all or nothing solution. Throwing an error telling user to set it sounds reasonable.

ngimel on 11 Jul 2020

Currently, it doesn't seem like it's possible to check the CUDA version from a non-nvcc compiled file, so I believe I'll have to add that to aten/src/ATen/cuda/detail/CUDAHooks.h (checking cuDNN version is part of that interface). If anyone knows any better, please let me know.

kurtamohler on 13 Jul 2020

The above commit adds the error. But I need to figure out what to do with the unit tests now. There are two problems:

In order to test that the error is being thrown in the correct case (cuda >= 10.2 and CUBLAS_WORKSPACE_CONFIG is not set properly), the testing infrastructure would have to be able to automatically change the environment variable before running a test
To make sure that the existing torch.set_deterministic tests don't break, we would need to automatically set CUBLAS_WORKSPACE_CONFIG properly. We could potentially just set this variable by default in all the CI jobs that use cuda >= 10.2.

I found out that I am able to set environment variables from a python script, then reload the torch module to make it honor the new value:

>>> import torch
>>> torch.set_deterministic(True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/work/kurtamohler/development/pytorch-deterministic-flag-cuda-env-var/torch/__init__.py", line 306, in set_deterministic
    _C._set_deterministic(d)
RuntimeError: To enable deterministic behavior with CUDA >= 10.2, you must set environment variable CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
>>> import os
>>> os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
>>> from importlib import reload
>>> torch = reload(torch)
>>> torch.set_deterministic(True)

I don't know if reloading torch will also cause CUDA to honor this change, but at least this gives us a way to unit test for the error message. Although I have to ask, could there be any issue with reloading the torch module inside a unit test?

EDIT: Turns out that I don't need to reload torch to make it see the changed environment variable. Also, reloading after changing the variable does not affect the CUDA runtime.

kurtamohler on 13 Jul 2020

The above commit addresses all the concerns I mentioned in my previous comment. I added a decorator to wrap any API test that calls torch.set_deterministic(), temporarily setting CUBLAS_WORKSPACE_CONFIG=:4096:8 only if needed. It also restores the deterministic flag and CUBLAS_WORKSPACE_CONFIG settings to what they were before the test was run.

kurtamohler on 14 Jul 2020

I realized that the reproducibility doc mentions that deterministic CuDNN behavior requires:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Does someone on this thread know what benchmark is exactly, and why torch.backends.cudnn.deterministic = True by itself is not sufficient?

We might want to force benchmark to be turned off if torch.is_deterministic() == True. In other words, instead of passing ctx.benchmarkCuDNN() directly into at::_convolution(), maybe it should be ctx.benchmarkCuDNN() && !ctx.deterministic() on this line: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Convolution.cpp#L602

If we don't make this change, it seems like people who use set_deterministic and CuDNN will have to do this:

torch.set_deterministic(True)
torch.backends.cudnn.benchmark = False

Meaning that set_deterministic() alone would not cover everything, which is confusing in my opinion.

cc @ezyang @colesbury @t-vi @ngimel

kurtamohler on 16 Jul 2020

When encountering a new convolution configuration, benchmark=True runs all available cudnn implementations and picks a fastest one, caching the implementation that it picked, so all subsequent calls to convolution with the same parameters will use it. So, if deterministic is also set to True the results will be deterministic as long as this cache persists, that is, as long as you are in the same process. If there are implementations with close runtime, the next time you start the process and run benchmarking again, another implementation may win, and the results (while still deterministic in the sense described above) will be different from the previous run. So, to guarantee determinism between runs you have to turn benchmarking off.

ngimel on 16 Jul 2020

I see. So perhaps only in-process determinism, not cross-process determinism, matters for some applications, so people could find it useful to still be able to use benchmarking if they set torch.set_deterministic(True). In that case, I should not change the current behavior. As long as I update the docs to make that clear, I don't see a problem with it.

kurtamohler on 16 Jul 2020

I made a wiki page to help PyTorch contributors add support for torch.set_deterministic(): https://github.com/pytorch/pytorch/wiki/How-to-support-%60torch.set_deterministic()%60-in-PyTorch-operators

Any improvements are welcome.

Also, I wasn't sure if the "Currently unsupported functions" section should be in this wiki or if it would be better as a new github issue (the wiki page could link to it). Does anyone have a preference?

kurtamohler on 22 Jul 2020

Hi, I want to talk about the plan going forward for torch.deterministic. There are a few high-level questions that we need to answer:

What are the semantics of torch.deterministic? What does the user expect? Is best effort actually useful for a user? If it is not useful, is it better to define torch.deterministic in terms of what operations it controls?
Now that we have the torch.deterministic flag, does it make sense to eliminate the deterministic= keyword argument entirely from public facing API (bmm, I'm looking at you).
What is the end game for this work? How much of this are you (@kurtamohler) going to work on, versus the generic community, and when we get to the end of your stint here, what does a reasonable state look like?

Starting with (1), the current documentation for torch.deterministic says:

     r"""Sets a global flag to force all operations to use a deterministic
    implementation if available. If an operation that does not have a
    deterministic implementation is called while this setting is True, the
    operation will throw a RuntimeError.

    Note that deterministic operations tend to have worse performance than
    non-deterministic operations.

We could change the documentation of torch.deterministic to ameliorate this. Some possible changes:

torch.deterministic is best effort, but please report bugs if you see that it doesn't catch some non-determinism
torch.deterministic toggles the behavior of these operators (and then give an exhaustive list of the operators it toggles)

cc @gchanan @mruberry

ezyang on 22 Jul 2020

👍3

@zou3519 intersected with the Q too at https://github.com/pytorch/pytorch/pull/38683#issuecomment-662590937

ezyang on 22 Jul 2020

I'm glad you brought up these questions @ezyang, @zou3519, and @mruberry. I agree that the documentation I wrote is a false representation of the current state.

I like the idea of exhaustively listing all the functions that torch.set_deterministic() affects, so that we're not lying to the user. Thanks for adding that to 1.6.0, @zou3519.

I agree that we shouldn't offer the deterministic setting as direct function arguments.

As for the end game, I am happy to keep working on this for as long as necessary, but it should be set up so that anyone can quickly learn how to help.

In the long run, I think that providing an exhaustive list of affected functions is a valid decision, but I don't think that strategy alone would maximize the usefulness of the deterministic flag. We can categorize functions (in one specific environment) like this:

Deterministic
Nondeterministic by default, but has support for the deterministic flag (either error or alternate implementation)
Nondeterministic and does not have support for the deterministic flag

Of course the ideal case is to completely eliminate category 3, and then the list of category 2 functions would be sufficient. However, category 3 functions will still exist for a significant period of time (or perhaps forever, if not all contributors are aware of the question of determinism, or a commit accidentally removes determinism for a function, etc.). So even if we have an exhaustive list of all the category 2 functions, the user has no simple way to know if a function that does not appear on the list is deterministic or not (could be category 1 or 3). For instance, torch.add doesn't appear on the list, so how does the user know that it's deterministic?

Perhaps we could think about maintaining a list of category 3 functions as well. But manually maintaining these lists would be very difficult for many reasons, so I wonder if we could automate it somewhat. We could potentially set up a CI job that runs determinism tests on all functions. It's not possible to 100% prove inductively that a function is deterministic, and a nondeterministic function may sometimes give the same result multiple times if we're unlucky. But the more often we run these tests, the more confident we could become about which category each function is part of.

There is also a question of how to most efficiently convey to the user everything that we know and don't know about each function and each platform. Maybe we could make a table of all the category 2 and 3 functions on each platform. It would be nice if the determinism tests could automatically verify that this table is correct.

Just brainstorming, maybe these ideas are more difficult than they are worth. A more pragmatic plan could be significantly more sustainable, even if less ideal.

kurtamohler on 23 Jul 2020

Is torch.add deterministic?

import torch
n = 512
device = 'cuda'
a = torch.arange(n**3, device=device, dtype=torch.float32)
a = a.reshape((n, n, n))
b = torch.arange(n**3, device=device, dtype=torch.float32)
b = b.reshape((n, n, n))
out_zero = torch.zeros((n, n, n), device=device)
out_zero = out_zero.set_(out_zero.storage(), storage_offset=0, size=a.size(), stride=(1,1,1))
out_one = torch.zeros((n, n, n), device=device)
out_one = out_one.set_(out_one.storage(), storage_offset=0, size=a.size(), stride=(1,1,1))

torch.add(a, b, out=out_zero)
torch.add(a, b, out=out_one)
(out_zero == out_one).all()
: tensor(False, device='cuda:0')

We should probably document that overlapped tensors violate whatever determinism contract we're going for.

Listing the operations affected by a "determinism" flag sounds good. Stepping back slightly, though, it seems like we're really talking about two things:

Requesting deterministic versions of operations, if available (use_deterministic?)
Warning if an operation is nondeterministic

A flag for the first thing seems straightforward. The second, however, is a little trickier. I worry that it's hard to tell if the operations of math libraries like oneDNN, cuDNN, and MAGMA are deterministic, especially across versions and hardware. Do you have an idea for how best to address this, @kurtamohler? Maybe we could warn on all native non-deterministic operations and also warn when math library calls were made, too? Warning once per process shouldn't be that intrusive.

This approach to warnings would require reviewing a lot of algorithms and call sites before going live, but it needn't block the flag to select deterministic algorithms if they're available.

(A third thing under discussion is the best way to present deterministic algo selection (via a global flag or as kwargs on functions), but I think we can delay that discussion until we determine a plan for the flag(s)?)

mruberry on 24 Jul 2020

I think we should not let the perfect be the enemy of the good here. I don't know when it was 100% safe to use self-overlapping tensors with PyTorch, and my impression is that it's not that common people use them.

My impression from the forums is that most people are surprised that they run something twice and get different gradients from it, most often because one of PyTorch's native functions using atomicAdd.
If we get warnings for that, we have covered most cases people are wondering about. Something what feels like half of it is actually from upscaling backward.

I think we should clearly state that this is best-effort as far as external libs are concerned and that we add warnings as we get to know issues, but my impression is that our native kernels actually are what matters most.

t-vi on 24 Jul 2020

👍1

I don't know when it was 100% safe to use self-overlapping tensors with PyTorch, and my impression is that it's not that common people use them.

Yes, and any programs that does might reasonably be classified as in error. I just meant we should be careful to document whatever contract we come up with for these flags.

I think we should clearly state that this is best-effort as far as external libs are concerned and that we add warnings as we get to know issues...

The doc might say something like, "math library calls that are known to be nondeterministic..."?

mruberry on 25 Jul 2020

I agree with @t-vi (and I really like the observation that half of the reported nondeterminism is upscaling backward). In particular, I think a state where we have partially documented functions that are known to be nondeterministic (or even partially documented some functions to be deterministic) is strictly better than one where we don't give any indication at all--the key thing is to not claim to support things we don't! I agree that it is a useful activity to think about how one could go about testing for determinism, but I think this as an orthogonal activity to flagging APIs which are obviously non-deterministic.

Since a lot of ideas have been floating around, let me just chime in with my specific thoughts about some of them:

"Perhaps we could think about maintaining a list of category 3 functions as well." This seems like a lot of work. I think it's probably only worth it for functions where we explicitly made some accommodations for determinism (most likely, functions that support the deterministic flag)
"We could potentially set up a CI job that runs determinism tests on all functions." I think something like this would have to be done with a lot of care, because by its very nature it is testing for something that is nondeterministic, and that means that the determinism test itself is "flaky" (will pass sometimes and fail others). Our CI reporting tools don't handle situations like this very well.
"The second, however, is a little trickier. I worry that it's hard to tell if the operations of math libraries like oneDNN, cuDNN, and MAGMA are deterministic, especially across versions and hardware." We should best effort this. In many cases the math library explicitly specifies in their documentation that they are deterministic or not, and we should simply faithfully report what the docs say
"Maybe we could warn on all native non-deterministic operations and also warn when math library calls were made, too?" I don't think we should do this. When we warn about nondeterminism, it should be because nondeterminism IS happening, not that it MAY be happening. If you overwarn, people will start ignoring the warnings.

ezyang on 27 Jul 2020

I don't think we should worry about cross version/hardware determinism -- good luck with that.

When we warn about nondeterminism, it should be because nondeterminism IS happening, not that it MAY be happening. If you overwarn, people will start ignoring the warnings.

it does seem tricky. E.g. what if I'm running some op and the PyTorch implementation is deterministic, but some extension has overridden something (via a dispatch key, torch function, or else) and now I don't know. If that's actually the source of my non-determinism, that seems like a bummer not to be warned.

gchanan on 27 Jul 2020

If that's actually the source of my non-determinism, that seems like a bummer not to be warned.

Sure, but the user could also just not involve us in their nondeterministic shenanigans, and then of course you wouldn't expect to be warned then ;)

ezyang on 28 Jul 2020

I believe we can close this issue now since the flag API exists and is well documented.

kurtamohler on 17 Sep 2020

🎉2

@kurtamohler Awesome work. Thank you.

t-vi on 17 Sep 2020

👍1

Does it mean that, we could use torch.manual_seed(111) to set everything deterministic, including the interpolation operation ?

CoinCheung on 17 Sep 2020

No. Take a look at the Reproducibility / Randomness note.
So far, we have infrastructure, marked the known sources on non-determinism and greatly improved documentation so you can know what's going on.
If you hit non-deterministic operations, you're still out of luck, but now it is more reasonable to work on it.

The interpolation in particular seems something that could be made deterministic by writing a not all that complicated kernel for the backward.

t-vi on 17 Sep 2020

@t-vi Hi, Now that pytorch 1.7 is released, has the interpolation backward kernel been updated ?

CoinCheung on 2 Nov 2020

So the CUDA upsampling kernels and backwards lie in aten/src/ATen/native/cuda/UpSample* . A grep suggests that linear, bilinear, cubic have nondeterministic backwards (they have a warning marker), but nearest do not.
@kurtamohler would be the much better person to ask, though.

t-vi on 2 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings