Jest: cache write race condition across processes

Created on 7 Sep 2017 · 79Comments · Source: facebook/jest

Do you want to request a feature or report a bug?
bug

What is the current behavior?
When running tests in parallel with the new atomic cache writing, we're getting rename errors as multiple processes try to write to the same files. Even with --no-cache option set it's still hitting rename errors because it's still trying to write to the files.

What is the expected behavior?

I think that --no-cache should not write cache files
Caching across multiple processes should not collide, or should be able to restart the test.

Please provide your exact Jest configuration and mention your Jest, node, yarn/npm version and operating system.

{
    "clearMocks": true,
    "collectCoverageFrom": [
        "packages/**/src/**/*.{ts,tsx}",
        "!packages/sf-lint/**",
        "!**/*.d.ts"
    ],
    "coverageReporters": [
        "text-summary"
    ],
    "moduleFileExtensions": [
        "ts",
        "tsx",
        "js",
        "json"
    ],
    "setupTestFrameworkScriptFile": "<rootDir>/jestEnvironment.js",
    "transform": {
        "\\.(ts|tsx)$": "<rootDir>/scripts/preprocessTypescript.js",
        "\\.(less|css|svg|gif|png|jpg|jpeg)$": "<rootDir>/scripts/preprocessText.js"
    },
    "testRegex": "(Spec|.spec).tsx?$"
}

jest 21.0.1
node 6.9.2
yarn 0.27.x/1.0.0
OS Windows

Help Wanted Windows

Source

bookman25

👍44

Most helpful comment

just to chip in -- I'm seeing this with jest 23.6 on a windows Jenkins CI server.

--runInBand does work, but doubles testing time, so it's not great, and since we have tests run before push, I can't enable this without making my team members super-sad
the graceful-fs override in package.json, as mentioned by @jsheetzati (https://github.com/facebook/jest/issues/4444#issuecomment-370533355) works, but it's a bit of a hack.

Since graceful-fs isn't doing much about this (https://github.com/isaacs/node-graceful-fs/pull/131 hasn't seen action since July last year), perhaps it's time to fork? I've added a nag comment there, but I'm not expecting that to make anyone suddenly jump to sorting this out )':

fluffynuts on 23 Jan 2019

👍9

All 79 comments

Isn't this related? https://github.com/facebook/jest/pull/4432

thymikee on 11 Sep 2017

I don't believe so. I believe the case we see in our repo is the exact same file getting mocked for 2 different processes (while running in parallel) which causes the cache write operation to fail because one process has the file locked. That ticket looks like is more about different files with the same contents. We don't have any of those cases in the repositories we host that we ran into this issue.

bookman25 on 12 Sep 2017

We basically run into the same issue with our tests. One easy way to reproduce was to remove jest cacheDirectory to force cache generation on the next run.
``` Test suite failed to run

jest: failed to cache transform results in:

C:/myniceproject/src/jest-cache/jest-transform-cache-b2e8f1f700b9bd266a0d27bb01b47a2b-34a7e3d71d38ff01f65fdb5abdf5126b/3f/settingsProvider_3f1439e55275a95ecfdb7dcb432f7958
Failure message: EPERM: operation not permitted, rename
'C:\myniceproject\srcjest-cachejest-transform-cache-b2e8f1f700b9bd266a0d27bb01b47a2b-34a7e3d71d38ff01f65fdb5abdf5126b\3f\settingsProvider_3f1439e55275a95ecfdb7dcb432f7958.1630729137'
->
'C:\myniceproject\srcjest-cachejest-transform-cache-b2e8f1f700b9bd266a0d27bb01b47a2b-34a7e3d71d38ff01f65fdb5abdf5126b\3f\settingsProvider_3f1439e55275a95ecfdb7dcb432f7958'`

cristiango on 21 Sep 2017

👍1

Having the same issue and can't find a way around it. Jest is basically unusable for us like this.

DavidKuennen on 25 Sep 2017

👍5

We are trying to update to 21.2.0 from 20.0.4 and now we have the following error on our build servers:

Test suite failed to run
[13:46:50]
[13:46:50]    jest: failed to cache transform results in: C:/TeamCity/buildAgent/temp/buildTmp/jest/jest-transform-cache-c60bb8ad55f1dbc29115038800657f2f-4895fc34da3cd9ad1e120af622bca745/3b/fe-selectors_3b56db772e798e2e4d0c9fc7c9e4a770
[13:46:50]    Failure message: EPERM: operation not permitted, rename '...\jest\jest-transform-cache-c60bb8ad55f1dbc29115038800657f2f-4895fc34da3cd9ad1e120af622bca745\3b\fe-selectors_3b56db772e798e2e4d0c9fc7c9e4a770.1701848979' -> '...\jest\jest-transform-cache-c60bb8ad55f1dbc29115038800657f2f-4895fc34da3cd9ad1e120af622bca745\3b\fe-selectors_3b56db772e798e2e4d0c9fc7c9e4a770'
[13:46:50]      
[13:46:50]      at Object.fs.renameSync (fs.js:772:18)
[13:46:50]      at Function.writeFileSync [as sync] (node_modules/write-file-atomic/index.js:192:8)

vkrol on 27 Sep 2017

👍5

I'm now having the same issue tests are broken randomly.

apphelpyulongs on 27 Sep 2017

If I run the tests with the --runInBand flag then as expected everything is OK.

vkrol on 29 Sep 2017

I can see the same issue fairly consistently:

  ● Test suite failed to run

    jest: failed to cache transform results in: .../jest-transform-cache-...
    Failure message: EPERM: operation not permitted, rename '...' -> '...'
        at Error (native)

      at Object.fs.renameSync (fs.js:810:18)
      at Function.writeFileSync [as sync] (node_modules/write-file-atomic/index.js:192:8)

jest 21.2.1
node 6.11.1
OS Windows

--no-cache does not help and jest-transform-cache is still being written. The only thing that helps is --runInBand, which is hardly acceptable for large projects.

Anything we can do to help diagnose the issue? Should I create a repro case?

Is this error critical? Can it be treated as a warning rather than taking down the whole test suite? Is there a way to back off and retry?

asapach on 7 Oct 2017

Having a small repro would be great

SimenB on 7 Oct 2017

Here's the repro: https://github.com/asapach/jest-cache-issue/
It effectively runs lodash-es through babel-jest to populate the transform cache.
This fails for me 80% of the time on two different machines (Win8.1 and Win10).
If you remove --no-cache it fails 100% of the time. Adding --runInBand brings it down to 0%.

(Out of curiosity tried running it in WSL on Win10 and the issue is not reproducible using Posix API)

asapach on 8 Oct 2017

Is this just happening on Windows? I don't have access to windows machines beyond virtual machines, so not the easiest to debug for me...

@jeanlauliac you added write-file-atomic in #4088, would you be able to help out?

SimenB on 8 Oct 2017

This problem is very similar to https://github.com/npm/write-file-atomic/issues/10 and https://github.com/npm/write-file-atomic/pull/22.

vkrol on 8 Oct 2017

👍1

Just ran a procmon trace, here's an example of the issue:

As you can see two processes are trying rename the same file within 1ms of each other and the second one fails.

I think npm/write-file-atomic#22 addresses the async version of writeFile(), but writeFileSync() is still affected.

asapach on 8 Oct 2017

Would it be possible to create a repro showing that just using write-file-atomic in worker-farm against the same file fails somehow? Would be great to open an issue against that project, as I think that's where the fix should be.

Or if you could write a test within jest that shows the same error (we have appveyor CI) that could be a start as well?

I'm not even sure what behavior we want in case of this error. Retry the write? Rerun the test? The whole test file?

SimenB on 8 Oct 2017

OK, I'll try to create another repro. Not sure it's possible to create a jest test, because it would require spawning multiple processes, disabling/cleaning the cache and keep running until it fails.

I'm not even sure what behavior we want in case of this error.

Well, firstly the issue should not even happen when --no-cache is on, since the cache should not be populated.
Secondly, I'm not sure it's possible to retry the sync operation properly - is it possible to use writeFile() instead of writeFileSync()? That way write-file-atomic should retry automatically (I'll create a test to confirm).

asapach on 8 Oct 2017

Well, firstly the issue should not even happen when --no-cache is on, since the cache should not be populated.

That's a good point, and should be fixed separately. That way --no-cache can at least be a workaround.

Secondly, I'm not sure it's possible to retry the sync operation properly - is it possible to use writeFile() instead of writeFileSync()?

@cpojer thoughts on making it not be sync? Not sure how that scales. Or if you have another idea on how to fix this

SimenB on 8 Oct 2017

--no-cache is more like --reset-cache actually. It means it won't use the existing cache, but it will still write cache. I'd like to retain that.
These operations have to be sync, because they happen during require calls in user code, so we can't change that.

cpojer on 8 Oct 2017

😕1

Here's the other repro with worker-farm and write-file-atomic: https://github.com/asapach/write-atomic-issue

Findings so far: the sync version fails as expected, but surprisingly the async version fails as well. This means that they probably implement a retry queue only when it runs in the same process, which doesn't help in our case.

asapach on 8 Oct 2017

I'd like to retain that.

New flag? It's a highly misleading name. And on e.g. CI you rarely want the cache anyways, so it's just wasted resources. Or is a cache generated within a single test run used during --no-cache, and it only ignores existent caches?

Here's the other repro with worker-farm and write-file-atomic

Awesome! Could you open up an issue against write-file-atomic? It feels like a fix should go there, and if not (they don't want to support multiple processes writing at once) we can revisit on our side. WDYT?

SimenB on 8 Oct 2017

👍1

A patch I tried locally that seemed to work is ignoring the error if it comes from trying to rename to a file with the same content. Since it just means another process 'won' the race, we can ignore it and move on.

const cacheWriteErrorSafeToIgnore = (
  e: Error,
  cachePath: Path,
  fileData: string,
) => {
  if (
    e.message &&
    e.message.indexOf('EPERM: operation not permitted, rename') > -1
  ) {
    try {
      const currentContent = fs.readFileSync(cachePath, 'utf8');
      return fileData === currentContent;
    } catch (e) {
    }
  }
  return false;
};

jwbay on 8 Oct 2017

👍2

@SimenB, sure, I'll file an issue.

@cpojer, can this error be swallowed/ignored and treated as a warning? It implies that the file has already been written and no data should be lost.

asapach on 8 Oct 2017

👍1

Upstream issue: npm/write-file-atomic#28

asapach on 8 Oct 2017

👍1

I think this means "rename" is not an atomic operation on Windows, so it breaks the assumption made by write-file-atomic. Unless there's a flag that could be enabled at the Windows API level, this could mean it's impossible to have atomic writes/renames on Windows altogether.

@jwbay your solution looks reasonable to me! 👍 Instead of using indexOf however, I'd use e.code === 'EPERM' (more robust, doesn't depend on specific message). I don't think we should read the file again to check the value, because this could introduce additional concurrency issues (ex. if the file is being written by yet another process at the same time). Would you mind sending a PR, please?

jeanlauliac on 9 Oct 2017

I was about to start work on a PR for write-file-atomic along the lines of "if we're asked to write a file sync but it's already in the queue to be written async, bail out" (maybe with an option to switch the behaviour on). But if we're happy to handle this at the Jest level, I won't hurry. cc @jeanlauliac

davidjgoss on 9 Oct 2017

I was about to start work on a PR for write-file-atomic along the lines of "if we're asked to write a file sync but it's already in the queue to be written async, bail out" (maybe with an option to switch the behaviour on).

I think adding this logic (local queue) wouldn't fix the issue, because it happens mostly when different processes try to write to (rename to) the same file.

To fix concurrency issues once and for all, we may have to reconsider how we do caching, for example have a single process that access the cache, with which we communicate over some kind of IPC. Existing key/value store systems may be handy, such as memcached.

jeanlauliac on 9 Oct 2017

I think adding this logic (local queue) wouldn't fix the issue, because it happens mostly when different processes try to write to (rename to) the same file.

Ah, I perhaps misunderstood the issue then. The way I read it, the library already has a queuing mechanism that works nicely for the async requests, but if you mix in sync requests _as well_ you can get collisions.

davidjgoss on 9 Oct 2017

My above referenced pull request should solve this issue. At least it did it for me!

mekwall on 10 Oct 2017

@mekwall, I think they are using rename() in the async version of writeFile(), and it still fails in my test: https://github.com/asapach/write-atomic-issue. Could you please try running my repro? I think your change might minimize likelihood of this problem happening, but does not eliminate it completely.

asapach on 10 Oct 2017

@asapach Did you try with my changes? Because I tried several times, and I never got EPERM: operation not permitted, rename with my changes while getting it every time without.

mekwall on 10 Oct 2017

@mekwall, yep, still failing with your changes (although async-ly). (Corrected below.)

asapach on 10 Oct 2017

Or rather technically it doesn't fail (because the sync flow is not interrupted), but the console is still littered with EPERM errors.

asapach on 10 Oct 2017

@asapach I found the issue you're having. It's in the graceful-fs polyfill. I've posted a fix in this PR: https://github.com/isaacs/node-graceful-fs/pull/119

mekwall on 10 Oct 2017

@mekwall, yes this does seem to address the issue - no more errors in both sync and async versions.
The problem now is that temp files are not removed, because fs.unlinkSync(tmpfile) is never called: https://github.com/npm/write-file-atomic/pull/29/files#diff-168726dbe96b3ce427e7fedce31bb0bcR198

asapach on 10 Oct 2017

@asapach I added unlink to graceful-fs rename, but I'm not sure if that's the right way to go. Afaik fs.rename uses the MoveFile function and that should not copy the source to the destination. The source should just change name and the source and destination should never exist at the same time.

mekwall on 10 Oct 2017

@mekwall, that does help a bit, but in some cases if the worker is terminated early (because all the work is done), some files are not cleaned up, since it doesn't wait for the cleanup. The async version seems to work fine.

asapach on 10 Oct 2017

@asapach It's not working as expected at all. I need to dive into the innards of node to figure out how it's actually working and what the intended behavior should be. I believe the whole point with graceful-fs is to have it work the same on every platform, so I'll dig deeper into that. At least we've found the culprit :)

mekwall on 10 Oct 2017

@asapach I realized that my PR for write-file-atomic wouldn't work, so I took another approach by adding fs.renameSync in graceful-fs with the same workarounds as fs.rename but blocking. This makes your test work just as expected!

mekwall on 11 Oct 2017

@mekwall, Thanks, I've verified your changes on both of my repro cases and none of them fail.
I think on the downside I see increased CPU and disk usage for sync, but it's probably expected.

asapach on 11 Oct 2017

Thanks a lot folks for picking this up and helping to get it fixed. Much appreciated! ❤️ Hopefully the fix in graceful-fs is the correct one, and it gets released.

SimenB on 11 Oct 2017

@SimenB You're welcome! We're pained by this at work so I got some time to investigate this by my team. The changes will affect a lot of packages so it will most likely take time for them to accept it :/

mekwall on 11 Oct 2017

Any idea when this workaround will make it to a release?

manu-st on 20 Oct 2017

👍9

@cpojer could you provide some more info why it's closed? is there a fix provided? We still have this issue

samvloeberghs on 24 Nov 2017

👍2

Apologies, seems like the fix has not landed in graceful-fs yet :(

cpojer on 24 Nov 2017

👍4

Can multiple people confirm that using https://github.com/isaacs/node-graceful-fs/pull/119 fixes their issues?

You can use the fork by using yarn resolutions, see https://yarnpkg.com/en/docs/selective-version-resolutions, which should allow you to deploy the fix to CI etc.

e.g.

{
  "resolutions": {
    "graceful-fs": "mekwall/node-graceful-fs#a10aa576f771d7cf3dfaee523f2e02d6eb11a89f"
  }
}

SimenB on 14 Dec 2017

👍4 🎉1

@SimenB It solves the issue for me, at least 😄

mekwall on 15 Dec 2017

+1 For me as well.

asapach on 15 Dec 2017

@SimenB Also fixed my issue and I'm now able to use jest 22 on windows. (We were stuck on 20 before this).

Edit: Actually, it worked on my dev laptop, but did not work on the build server. It's running yarn 1.2.1 though, maybe that's why?

[16:47:55][Step 5/8]     jest: failed to read cache file: D:/TeamCity/buildAgent2/temp/buildTmp/jest/jest-transform-cache-c39254d365b4bcb2c90f133d4b359d91-56a1804d8d831b3401a35a7e81857f3b/7e/rafPolyfill_7e7a83ed3f2680ba9aec0e45f59ade5d
[16:47:55][Step 5/8]     Failure message: EPERM: operation not permitted, open 'D:\TeamCity\buildAgent2\temp\buildTmp\jest\jest-transform-cache-c39254d365b4bcb2c90f133d4b359d91-56a1804d8d831b3401a35a7e81857f3b\7e\rafPolyfill_7e7a83ed3f2680ba9aec0e45f59ade5d'
[16:47:55][Step 5/8]       
[16:47:55][Step 5/8]       at readCacheFile (node_modules/jest-runtime/build/script_transformer.js:465:60)

mreishus on 19 Dec 2017

Yarn 1.0.0 should be enough, worth a try upgrading, though

SimenB on 20 Dec 2017

Just tried to put the resolution in but it is still failing for me. However I have both ENOENT and EPERM violations:

    jest: failed to read cache file: C:/Users/dev/AppData/Local/Temp/jest/jest-transform-cache-857f905b2da01d52a9d1d17b6772ea4a-3a91587e29d4fef23c6e0cf16b2f5679/7d/index_7d0afc82f0b29ec31c4b5f296cbdee74
    Failure message: ENOENT: no such file or directory, open 'C:\Users\dev\AppData\Local\Temp\jest\jest-transform-cache-857f905b2da01d52a9d1d17b6772ea4a-3a91587e29d4fef23c6e0cf16b2f5679\7d\index_7d0afc82f0b29ec31c4b5f296cbdee74'

      at Object.fs.openSync (../fs.js:653:18)
      at Object.fs.readFileSync (../fs.js:554:33)

and

    jest: failed to read cache file: C:/Users/dev/AppData/Local/Temp/jest/jest-transform-cache-857f905b2da01d52a9d1d17b6772ea4a-3a91587e29d4fef23c6e0cf16b2f5679/c4/std_pb_c403e6e7645c904896b66f44a3e43606
    Failure message: EPERM: operation not permitted, open 'C:\Users\dev\AppData\Local\Temp\jest\jest-transform-cache-857f905b2da01d52a9d1d17b6772ea4a-3a91587e29d4fef23c6e0cf16b2f5679\c4\std_pb_c403e6e7645c904896b66f44a3e43606'

      at Object.fs.openSync (../fs.js:653:18)
      at Object.fs.readFileSync (../fs.js:554:33)

manu-st on 2 Jan 2018

@mreishus Does your build server run Windows? Because the fixes in graceful-fs will only target Windows, but it shouldn't happen on a Linux-based OS.

mekwall on 10 Jan 2018

@mekwall yes, windows - but it's windows server 2012 R2

mreishus on 11 Jan 2018

This is a major issue for me and nothing has happened with graceful-fs since November 2016 of what I can see. So I'm getting quite pessimistic that the fix @mekwall provided won't be merged any time soon. Is there any temporarily solution we can use other than the -i flag and the resolution workaround?

frenic on 2 Feb 2018

Does --runInBand not work for you @frenic?

cameronlowry on 5 Feb 2018

That's the same as -i and yes, it works. But unfortunately it's not sustainable in the long run for larger projects.

frenic on 5 Feb 2018

I guess we could fork and publish our own, but it doesn't seem like the fix works for everybody

SimenB on 5 Feb 2018

I'm up with the same situation, but in my case, --runInBand doesn't work.

unsidez on 5 Feb 2018

I've checked the graceful-fs override with the latest version of Jest and unfortunately it no longer seems to work as reliably since I last tested it. There's still non-zero chance that it runs into a race condition on large test suites.

asapach on 8 Feb 2018

😕1

After scrolling through this thread, I found a resolution using yarn. Is there a resolution using npm instead?

janacm on 28 Feb 2018

👍1

We have had pretty good luck so far just adding the patched version of graceful-fs to our package.json. Works for us with npm and yarn.

"graceful-fs": "https://github.com/mekwall/node-graceful-fs.git#patch-1",

jsheetzati on 5 Mar 2018

👍2 🎉1

Hi,

For some reason we only get this error when running from Jenkins, not when run locally (even on same machine/folder etc.)
@jsheetzati 's solution is working for us too (using npm), but it is a patch after all. Is there an ETA for resolving this permanently?

Thanks,
Mor

pyrocks on 25 Mar 2018

We also have this issue when running jest from Jenkins. --runInBand option helps to avoid failure during single job execution but jest still fails when running several builds in parallel.
As a workaround we use lockable resources plugin to make sure that only one jest process is executed at the same time keeping the --runInBand option.
Hope this comment will be useful for someone.

nyrkovalex on 25 Mar 2018

@nyrkovalex What we do to avoid the issue you're describing is use Jest's cache directory option to make sure the cache is not shared across workspaces.

We do so by publishing a Jest preset that sets cacheDirectory: '<rootDir>/.jest-cache' and making sure that all packages use it. If doing so make sure to add .jest-cache to .gitignore.

Before adding that option we would run into several issues as a result of having a global Jest cache shared across 16 executors per Jenkins agent. Using lockable resources will also prevent the issues as you mentioned but is wasteful as you are not using your Jenkins agent to its potential (as the Jest tests become a bottleneck).

anescobar1991 on 26 Mar 2018

👍7

@anescobar1991 That option is definitely a better solution, we'll consider using it.
Thanks for a tip!

nyrkovalex on 27 Mar 2018

Hi,

we use gradle to run npm (don't ask why :) ) and the combination of that with Jenkins is a killer.
We tried:

setting the cache to be on the local directory instead of global cache
using --runInBand
only a single job running on the agent- no parallel jobs
running gradle test --max-workers 1 (and not using --parallel)

All fail with the same error.
The only solution that works for us is the one by @jsheetzati - I wish this would get formally fixed.

pyrocks on 28 Mar 2018

We could probably fork and publish with that fix

SimenB on 28 Mar 2018

👍2

that would be awesome...

pyrocks on 28 Mar 2018

👍1

I have this problem a lot and the patch for graceful fs worked for me. So i would appreciate this fix.

SamuelBrucksch on 3 May 2018

As a workaround to fiddling with graceful-fs, couldn't you simply give each worker process / thread it's own cache directory to avoid the race condition?

It's probably slow, but we have to use --runInBand on our CI server and that is way worse.

If someone can point me at the right files to look at, I might even try to write a patch. I have a really hard time navigating the source of jest.

StringEpsilon on 29 May 2018

👍1

I'm not sure what it is but it has been a few weeks possibly a couple of months and I haven't observed the failure anymore. We have been using jest 22.4.2 for a while and upgraded to 22.4.4 recently. We've also updated various other packages.

manu-st on 8 Jun 2018

👍1

just to chip in -- I'm seeing this with jest 23.6 on a windows Jenkins CI server.

--runInBand does work, but doubles testing time, so it's not great, and since we have tests run before push, I can't enable this without making my team members super-sad
the graceful-fs override in package.json, as mentioned by @jsheetzati (https://github.com/facebook/jest/issues/4444#issuecomment-370533355) works, but it's a bit of a hack.

fluffynuts on 23 Jan 2019

👍9

I'm having the same issue but the error message is different
Failure message: EBADF: bad file descriptor, close

jest: failed to cache transform results in: C:/agent/_work/_temp/jest/jest-transform-cache-2a12bf9796741cb06fb97a276aa09712-7d718cda2d798ae78eb741b3feff799e/7b/test-setup_7bdb1937d8dbbf5088142bc21e74a530
2019-01-24T13:44:55.5496091Z     Failure message: EBADF: bad file descriptor, close

It seems that running jest with --runInBand does not solve the problem the first time, but only after another run it executes without errors.

Running on Windows 10 Enterprise VM as part of a TFS Build.

esankin on 24 Jan 2019

👍6

@EthanSankin can you test with the linked graceful-fs PR as well?

SimenB on 24 Jan 2019

I'm working on a replacement for graceful-fs that should solve these issues. It's currently in alpha but it would be great to get some early adopters: https://github.com/mekwall/normalized-fs

mekwall on 24 Jan 2019

🎉1

reverting to older version of write-file-atomic solved the the issue.