Ninja: Option to use file characteristics instead of timestamps

Created on 14 Aug 2018  ·  15Comments  ·  Source: ninja-build/ninja

As noted in other issues, using timestamps to determine whether to rebuild something can be problematic. While timestamps are convenient and relatively fast, it is often desirable to key build decisions on some characteristic intrinsic to the file itself, such as a hash of its contents.

I use git a lot and it is annoying that simply changing branches triggers rebuilds. Ideally, I would be able to switch from my current work branch to some other branch, not touch any files and then switch back to the original branch and not have to rebuild anything at all. As far as I know that is not possible if the build system uses timestamps. Using file hashes would solve this particular issue.

I understand the using file hashes or other such intrisic file characteristics could slow down Ninja so using them should be an option.

feature

Most helpful comment

I've also used hashing at work, with great success. It is based on #929, but with a bunch of patches, as can be seen in https://github.com/moroten/ninja/commits/hashed. hash_input = 1 on selected rules is very convenient. My branch still contains a bug where files are stat too often, O(n^2) instead of O(n). The bug is related to phony edges.

One problem is how to handle phony edges. I use phony edges to group e.g. header files. Therefore, my implementation iterates recursively through phony edges. It also relates to the bug in #1021.

Another thought is to make the hashing first class member of ninja, i.e. move the hashes into the build log. Using SHA256 would be one step towards adding support for Bazel's remote execution API. A C++ implementation can be found at https://gitlab.com/bloomberg/recc/. Wouldn't that be pretty nice?

Unfortunately, sorting out the semantics for phony edges will probably break backward compatibility.

All 15 comments

929 has an implementation. Even though it is successfully used (in a fork) for thousands of builds daily, it was not considered for merging.

929 is single-threaded and can therefore be slower like ccache or other solutions. Also I think that this should be a command line flag instead, so that it doesn't require changes to the build definitions.

It can't (or at least should not) be a command line flag, as then hashing would apply to all rules. Hashing e.g. all inputs of link rules is expensive and therefore not desired, while source files and their known dependencies are good candidates. For that distinction, it has to be part of the build description. In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

Responding to the single-threaded argument: Yes, it adds instructions to the single-thread loop. In reality that only matters if the single thread gets more work than it can work down (I.e. more rules finish than the one thread can process (depslog+hashlog+...)). Only then hashing hurts. Else the single thread loop waits for jobs to finish anyway. We never saw a busy ninja with hashing even with -j1000 experiments. (And for fast-finishing rules, hashing is anyway not interesting to safe time.)

Also consider: the hashing with murmur hash is considerably fast and even large source files take only some miliseconds to be hashed. In addition the hashing happens right after the source file (and the dependencies) have been seen by the compiler. Therefore they are usually read from filesystem cache.
As hashing happens during the build (in parallel to executed rules), the overall build time is usually not measureably affected.

Lastly, the implementation in #929 is opt-in and comes at no cost for people not using the feature (besides the if-statement).

Hashing e.g. all inputs of link rules is expensive and therefore not desired, while source files and their known dependencies are good candidates.

I would say that hashing of inputs to the linker is especially desired as it can often result in the linking being skipped completely (e. g. for formatting changes or when comments are changed). As the compilation of object files finishes piece by piece, the hash calculation can happen while the build is running (as you pointed out).

If it's really too slow (e. g. with big static libraries), we could think about implementing hashes only for pure inputs and not intermediate files. That would solve the "switching Git branches causes full rebuilds" case at least.

In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

I would say that is an advantage: If I'm working on a single branch and want to iterate fast, I would not use hashes. If I'm comparing different features branches, I would use hashes.

In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

I would say that is an advantage: If I'm working on a single branch and want to iterate fast, I would not use hashes. If I'm comparing different features branches, I would use hashes.

But then there would need to be a way to switch from non-hashing to hashing, which means that the current state of files would need to be hashed, so the next rebuild can use the hashes (which probably didn't exist or are out of date if you didn't pass the flag).

I would guess that hashing still uses timestamp first, so that if the timestamps match, there's no need to compare the hashes. It would mean, that the first few builds might unnecessarily recompile some files, but that shouldn't happen often (most of the time the timestamp heuristic is right after all).

I've also used hashing at work, with great success. It is based on #929, but with a bunch of patches, as can be seen in https://github.com/moroten/ninja/commits/hashed. hash_input = 1 on selected rules is very convenient. My branch still contains a bug where files are stat too often, O(n^2) instead of O(n). The bug is related to phony edges.

One problem is how to handle phony edges. I use phony edges to group e.g. header files. Therefore, my implementation iterates recursively through phony edges. It also relates to the bug in #1021.

Another thought is to make the hashing first class member of ninja, i.e. move the hashes into the build log. Using SHA256 would be one step towards adding support for Bazel's remote execution API. A C++ implementation can be found at https://gitlab.com/bloomberg/recc/. Wouldn't that be pretty nice?

Unfortunately, sorting out the semantics for phony edges will probably break backward compatibility.

Meanwhile I have concocted a solution for my use case of switching branches in Chromium (which is particularly painful); the small Go program and script I use is here: https://github.com/bromite/mtool

Feel free to adapt it to your use-cases, if it works for Chromium I bet it will work for smaller build projects as well (and it takes negligible time for my runs). Only downside is that if you use the script I published there as-is it will litter .mtool files all around in each git repository parent directory, but nothing a global gitignore cannot cure.

Interesting to note that I use the git ls-files --stage output for the hashing needs; it is possible (but less efficient) to ask git to hash also non-indexed files if your build depends on those.

Feature-wise one would expect that to implement the feature being discussed here ninja could do the same internally (without relying on git) and with similar performance results.

After a quick look those ninja patches don't seem to hash the compiler commandline in addition to the input files. Am I missing anything?

The command line is already hashed by ninja and stored in the build log.

I would guess that hashing still uses timestamp first, so that if the timestamps match, there's no need to compare the hashes. It would mean, that the first few builds might unnecessarily recompile some files, but that shouldn't happen often (most of the time the timestamp heuristic is right after all).

That would be very wrong. Comparing hashes can be elided if timestamps don't match; the build system can assume that the dependency was modified, and if it wasn't really modified (only touched) then the build will be suboptimal but correct. However if the timestamps match, it is still possible that the dependency was modified and its timestamp was forcibly reset (EDIT: or that the target was touched and thus made newer than its dependencies). Without double checking by comparing hashes this would result in incorrect build.

I guess Ninja already assumes and skips all the work when timestamps match. So in your example this would be an incorrect build anyways.

I'm not aware (probably due to my ignorance) of any tool that would produce a different output and would keep the previous timestamp. Why would one do that?

IMHO, skipping the hash check when the timestamps match is a very valid optimization.

Hash check would simply avoid some "false-dirty" rebuilds, while keeping the existing semantics already provided by Ninja.

The problem is not false-dirty rebuilds, it's false-clean non-rebuilds. A git checkout touches everything it overwrites. It can make a target newer than a dependency (yes, people commit generated code for various valid reasons). A hash check would prevent a false-clean non-rebuild in this case.

@rulatir I think I mostly understand what you're saying :) I guess that's one of the reason why Bazel and other build systems based on hash-checks are really against in-tree target outputs.

Although, wouldn't this problem be solved if build system would check if targets are newer than previously known time, and rebuild if necessary?

@rulatir I think I mostly understand what you're saying :) I guess that's one of the reason why Bazel and other build systems based on hash-checks are really against in-tree target outputs.

How does hash depend on where the file is?

Although, wouldn't this problem be solved if build system would check if targets are newer than previously known time, and rebuild if necessary?

I understand that the main benefit of using timestamps is in avoiding the need to maintain a separate database that keeps track of "previously known" version signatures. If you are willing to forego that benefit, why shouldn't those signatures be hashes?

Was this page helpful?
0 / 5 - 0 ratings