Restic: Repository format v2

Created on 19 Sep 2016  ·  51Comments  ·  Source: restic/restic

I'd like to start the discussion on changing the repository format to version 2. This is needed in order to support compression (see #21).

The following list will be updated when new proposals come in.

Accepted:

  • Pack files: Move the header to the start of the file. At the moment, the header is at the end. I thought that it'd be nice to just write the file and when that is done write the header. However it turned out that in order to be able to retry failed backend requests, we need to buffer the file locally anyway. So we can write the content (blobs) to a tempfile, and then write the header when uploading the pack file to the backend. This allows reading the header more easily, since we don't need to start from the end of the file.
  • Pack files: At the moment the pack file header is a custom binary structure (see the design document). This is inflexible, requires a custom parser and does not allow extension without changing the repository format. I'd like to rebuild the pack header as a JSON data structure, similar to the way the tree objects are stored in the repo. This allows extension without having to change the underlying data format.
  • Pack files/Index: When the pack header is changed, add support for compression (algorithm, compressed/uncompressed length). Also add the compressed/uncompressed size to the index files.
  • Snapshot files: Allow packed snapshots so that having a lot of snapshots becomes usable (cf #523)
  • Add a README file into new repositories which describes what this directory contains.
  • Remove username and hostname from key files (#2128)

To be discussed:

  • Is there a way to add error-correcting codes to the files? Other ideas to recover from data errors?
  • Change the Index format to improve memory usage
  • Add an encryption indirection: Write down in the header which key is used for authentication/encryption of each blob (so we can implement #187 easier later on)

Postponed/rejected:

  • Switch to a faster hash function (SHA3/Keccak/Blake2 instead of SHA256)
  • Support asymmetric crypto

Anything else?

project repo v2 discussion

Most helpful comment

Is it important to have uncompressed size in index file or pack footer?

Yes: The pack header describes what's in the pack and this tells the extraction process what to expect (in terms of compression algorithm, uncompressed size, and later also other attributes like the key that was used for the encryption). The same needs to represented in the index, which has been introduced so that restic doesn't need to look up every blob in a pack header. So the same information needs to be present there.

In my opinion, repository format 2 == first byte of blob data indicates compression format, it's all that's needed. Maybe one of 255 possible formats could be {64-bits uncompressed length}{compressed data}.

I don't like this idea, it makes the file format more complicated: We'll have control information in two different places: at the start of a blob and in the header. The header is precisely the location that contains control information.

I think error correction is a good idea for backup. But i think it's a responsibility of the filesystem.

In principle I agree, but filesystems are very complicated things, and error propagation (e.g. of read/write errors of the medium) is often sub-optimal. For highly reduced (in terms of redundancy, e.g. deduplicated) backup data I still think it's a good idea to add (or offer to add) another layer of error correction.

All 51 comments

I'm not sure about moving the header to the front. I know this is not currently implemented, but for a local repository, having the header on the end means that we can save a file copy.

Interesting point, thanks. I'm not sure yet how to judge what's better. For remote backends we could also then (after a few changes to the backend interface) just pass an io.Reader and then maybe the stdlib can use sendfile to stream the file directly from disk. hm.

Just FYI I have been wondering why you don't use GCM, so I ran the benchmarks. AES-CTR + Poly1305 is pretty fast if the CPU doesn't have AES-NI (50% faster than Go built-in GCM). With AES-NI, Go's optimized assembly code for GCM is probably unbeatable.

Intel Xeon E312xx

restic:
BenchmarkEncrypt-4        50      32470322 ns/op     258.35 MB/s

stupidgcm:
Benchmark4kEncStupidGCM-4     200000         10620 ns/op     385.67 MB/s
Benchmark4kEncGoGCM-4         300000          5540 ns/op     739.22 MB/s

Intel Pentium G630 (no AES-NI)

restic:
BenchmarkEncrypt-2            10     108468078 ns/op      77.34 MB/s

stupidgcm:
Benchmark4kEncStupidGCM-2          50000         24182 ns/op     169.38 MB/s
Benchmark4kEncGoGCM-2              20000         96391 ns/op      42.49 MB/s

This does not belong into this issue, but I'll answer anyway:

I think at the time when I started restic, Go did not have an optimized version of GCM. In addition, I didn't feel comfortable to use GCM because I did not understand it, whereas the Poly1305 paper was much easier to read and understand.

I think your benchmark processes much smaller blobs of data, maybe it'll get closer when the blobs are larger.

I see. Yeah the optimized GCM is quite recent, I think Cloudflare donated it for Go 1.5.

Regarding block size, the restic benchmark uses 8 MiB while stupidgcm uses 4kiB. I retried with 8 MiB block size for stupidgcm but the results are virtually identical.

So let's not waste time on this, I think CTR+Poly1305 is fast enough.

Is it important to have uncompressed size in index file or pack footer? I think it would be OK to know it only within the blob, then, fewer changes are needed in restic. Does it enable any new features to have it known in this additional place?

In my opinion, repository format 2 == first byte of blob data indicates compression format, it's all that's needed. Maybe one of 255 possible formats could be {64-bits uncompressed length}{compressed data}.

I think error correction is a good idea for backup. But i think it's a responsibility of the filesystem. Do you also want to implement RAID inside restic?

Is it important to have uncompressed size in index file or pack footer?

Yes: The pack header describes what's in the pack and this tells the extraction process what to expect (in terms of compression algorithm, uncompressed size, and later also other attributes like the key that was used for the encryption). The same needs to represented in the index, which has been introduced so that restic doesn't need to look up every blob in a pack header. So the same information needs to be present there.

In my opinion, repository format 2 == first byte of blob data indicates compression format, it's all that's needed. Maybe one of 255 possible formats could be {64-bits uncompressed length}{compressed data}.

I don't like this idea, it makes the file format more complicated: We'll have control information in two different places: at the start of a blob and in the header. The header is precisely the location that contains control information.

I think error correction is a good idea for backup. But i think it's a responsibility of the filesystem.

In principle I agree, but filesystems are very complicated things, and error propagation (e.g. of read/write errors of the medium) is often sub-optimal. For highly reduced (in terms of redundancy, e.g. deduplicated) backup data I still think it's a good idea to add (or offer to add) another layer of error correction.

For Reed-Solomon codes, there is a pure Go implementation at https://github.com/klauspost/reedsolomon with some performance data.

According to https://www.usenix.org/legacy/event/fast09/tech/full_papers/plank/plank_html/ ZFEC should be faster. A implementation is in https://gitlab.com/zfec/go-zfec which seems to be based on https://pypi.python.org/pypi/zfec.

ECCs are applied after compression and are normally interleaved in the data file, because distributing them makes them more robust if the data is transferred over unreliable or noisy communication channels.

In Usenet binary groups they use separate files (see https://en.wikipedia.org/wiki/Parchive) which contains the ECC information. That would add just another subdirectory to the repository layout and applying ECC to the repo managment information (config, index, ...) would be easy, too. But I'm not sure if doing it that way would weaken the ECC schema (maybe the robustness against cluster errors within the ECC information decreases).

Thanks for the hints. I've found the PDF version of the paper here: https://www.usenix.org/legacy/event/fast09/tech/full_papers/plank/plank.pdf

The ZFEC Go implementation is only a wrapper around the C library.

For ZFEC there is a Go port with additions (use of goroutines) named jfec on [https://github.com/korvus81/jfec].

I've added a "project" (a recent addition to GitHub) to track implementing the new repository format: https://github.com/restic/restic/projects/3

Some ideas that could be looked into when breaking backwards compatibility:

  • Change from sha256 to sha512

Will using sha512 (or sha512/256) instead of sha256 result in increased backup speed? As far as I can see this is true for most plattforms except for ARM.

Syncthing discussion (https://github.com/syncthing/syncthing/issues/582)

Borg discussion (https://github.com/jborg/attic/issues/209)

Paper on sha512/256 (https://eprint.iacr.org/2010/548.pdf)

  • Using public key encryption instead of a simple password

Currently everybody who has access to write to the repository has access to read from it. Public key encryption would eliminate this and still allow deduplication based on the hashes.

Applying public key encryption to the data blobs would work, but im not familiar enough with how restic processes the tree structure to know if it could be implemented successfully for that as well. It could possibly introduce a lot of complexity. If only the data blobs are hidden, there is still alot of information in the trees.

NaCl - https://godoc.org/golang.org/x/crypto/nacl/box

  • Repository identification

Currently there is no way of knowing you are looking at a restic repository when you stumble upon the repository. We are currently leaking "created":"TIMESTAMP","username":"XXXXX","hostname":"XXXXX" in the key files. I would suggest hiding this information, and instead include some information about restic, such as restic repository version X. Might be as simple as a README.

Regarding previous discussions; I am highly in favor of implementing some form of error-correction.

@oysols Thanks for adding you ideas!

I'll add my thoughts below:

Change from sha256 to sha512 (for speed)

At the moment, I'm not concerned with speed (restic is already really fast), so at least for me this item is of low priority. There even exists an optimized version of SHA256 for SIMD-capable processors we can simply switch to. On the other hand, when we decide to speed restic up and the hash is to be discussed, I'd probably prefer Keccak (SHA3) or Blake2, those are (as far as I know, I didn't do any benchmarks yet) much faster.

So, from my point of view this item is postponed for now.

Using public key encryption instead of a simple password

This feature is planned (see #187), but is complicated and requires a lot of thought and several major infrastructure changes. I'd also like to postpone this and rather do smaller incremental updates instead of one where we change everything -> postponed.

Repository identification (add a README file into the repository)

Very good point, we can even add that now without breaking anything.

Repository "information leak" (removing username, hostname and created timestamp from key files)

That's also a good point. We currently only use this information to display it alongside the key ID in the key list command. We can easily drop username and host, the timestamp does not give much information, in most cases it will be the same as the file creation date.

I'd like to drop username and host and leave the created timestamp in. Thoughts?

I've played around with https://github.com/klauspost/reedsolomon today and I think we can add error-correcting codes rather easily to the end of the pack file (once we move the pack header to the start of the file). There are two downsides, though:

  • The file size will increase by ~14-30%, depending on the parameters we choose for reed-solomon
  • We will need to store checksums (not necessarily cryptographic hashes) of sections of the pack file in the pack file itself, these are needed for reconstruction because the algorithm for reconstruction needs to known which parts of the file have been damaged. So that takes a bit longer to compute, although we can choose to use a fast checksum (like CRC or so).

Thoughts?

Could data rod protection be optional then? I consider the size increase more than marginal (it is a great feature for other people though I believe!)

Let me play with this a bit, so I can get a feeling for how much larger (or smaller?) the repo will be when ECC is combined with compression. Maybe we add two kinds of codes: One for the pack header, and one (maybe optional) for the data.

drop username and host

Sounds like a good idea. If we want to keep the information it could be added to a separate encrypted field, in the same way as the master key.

ECC: The file size will increase by ~14-30%,

I dont think it is a good idea to include the ECC in the pack files themselves. They are of no use in a typical restore scenario, and only used in case the pack files are damaged.

I suggest that the parity data be placed in a separate directory:

repo/data/1e/1ef7267...
repo/parity/1e/1ef7267...
  • Parity will be completely optional, and may be created post backup.
  • No slow down of restore operations. No extra bandwidth needed for restore.
  • Identical filenames makes it easy to identify correct parity data. This will mean that the parity data is not named after its own sha256 hash, but no additional index will be needed. (Verification of the parity data should be done by verifying the pack files, anyway.)
  • The user gets an idea of the amount of parity data.

No matter how it is implemented; With many layers of compression and encryption i think that some kind of ECC is necessary. One wrong bit might cause a lot of trouble.

Thanks for your comments, let's move the discussion to a separate issue I've just created: #804.

I can't help but get the impression that there are two groups talking past each other about forward error correction codes in restic. One group wants to (just) protect the repo from bitrot, because even a single bitflip can create a huge problem in a deduplicated repo. The other group wants to use erasure codes to spread the repo over multiple failure domains (e.g. non-RAID disks). Both goals can be served by Reed-Solomon codes, but they require different parameters and different storage layouts.

I ran a quick check on my repo with my python script (https://github.com/oysols/restic-python).

header_length:        8688549
tree_length:         53898054
data_length:     146443506727
treeblobs:               8466
datablobs:             200975
packfiles:              29351
---- repo size by dir ----
            155   config
146 510 470 574   data
     27 538 629   index
          4 545   keys
          4 243   locks
         14 041   snapshots
          4 096   tmp
-----
Currently 116071 original files contained in the backup.

Of a 146 GB backup, the tree blobs are only 54 MB and will compress nicely to about a third of the original space, when we implement compression.

Would there be a performance improvement by separating the tree blobs from the data blobs?

It seems that most of the operations performed during a restore are done based on the tree blobs, before actually restoring the data. Separating them to separate pack files would minimize the amount of data that needs to be downloaded to parse the tree of a backup. Given the small size of the tree blobs it might even be faster to download all tree blobs before starting the restore process.

Of course; This distribution might not be the same for all repos.

Do you think this is worth looking into further?

Would there be a performance improvement by separating the tree blobs from the data blobs?

Maybe, this is one of the optimizations I have in mind for the future.

Apart from this I'd also like to add a local cache for the metadata, so that it doesn't need to be fetched from the repo at all. This should greatly improve speed of many operations.

Would there be a performance improvement by separating the tree blobs from the data blobs?

This could theoretically improve prune operation, as less repacking would be needed if tree blobs and data blobs were in separate packfiles (it might become possible to wholesale delete an old packfile instead of repacking it).

I'm already looking at that during #842

gcm vs ctr: http://www.daemonology.net/blog/2009-06-11-cryptographic-right-answers.html

sym vs asym: the idea is to pubkey-encrypt a "session"-key, right?

Let's not discuss crypto in this issue, as it is postponed for now. The relevant issue for asymmetric crypto discussions is #187. In addition, I'd like to keep the discussion on a high level until we've nailed down the use case. Then we can talk about low-level crypto.

Remove username and hostname from key files.

Huge metadata leak!
For example, "username":"WorldBank\\JimYongKim" clearly indicates high-ranking owner.

Waiting for this to be _removed_ (or _encrypted_) since first Windows binary was compiled in Jan 2017.
Fortunately, I examined backup before uploading or recommending Restic to privacy-aware fellows.

Edit: User’s timezone is also mentioned in plain text, that goes against confidentiality principle as well.

Re: SHA3 -- here's one opinion about why it's not worth adopting (yet?): https://www.imperialviolet.org/2017/05/31/skipsha3.html

Thus I believe that SHA-3 should probably not be used. It offers no compelling advantage over SHA-2 and brings many costs. The only argument that I can credit is that it's nice to have a backup hash function, but both SHA-256 and SHA-512 are commonly supported and have different cores. So we already have two secure hash functions deployed and I don't think we need another.

I've read the post and I understand agl's arguments. For restic, that's not so relevant: We're using the hash function to (uniquely) identify blobs, not as a building block of a cryptographic protocol. My idea to look at other hash functions was mainly that SHA-256 is slow to compute, especially on low-end systems. Other hash functions are much faster (e.g. blake2).

Not sure whether this is a repo-format thingy: How about making encryption optional? I am thinking of backups that will be stored on a trusted backup server that has encrypted disks already.

@mschiff See #1018 for that discussion. ;)

How about making part size an option ?
Currently I have 4-6 MB per file. With less files but larger, remote backup will be way faster.

@fd0 wrote:

At the moment, I'm not concerned with speed (restic is already really fast), so at least for me this item is of low priority. There even exists an optimized version of SHA256 for SIMD-capable processors we can simply switch to. On the other hand, when we decide to speed restic up and the hash is to be discussed, I'd probably prefer Keccak (SHA3) or Blake2, those are (as far as I know, I didn't do any benchmarks yet) much faster.

Another consideration for a faster, less CPU intensive hash algorithm (like Blake2) would be reduced battery usage on laptops when making backups while not connected to a power source.

Responding to the first post:

Remove username and hostname from key files.

Would this be replaced by key name or description of some sort? I believe some way to distinguish different keys (without having access to the key itself, e.g. when revoking access for some machine) is useful to make key management useful?

One new suggestion: How about using a different key for blobs, trees and snapshots? This would, AFAICS, enable a scenario where forgetting and pruning happens on the backup storage server, rather than in the clients. By giving the storage server access to the tree and snapshot objects, it should have sufficient info to determine what objects are needed by what snapshots and what objects are no longer used. If the storage server is every compromised, access is gained to the tree metadata, but not the actual file contents.

This can be made slightly stronger by only allowing access to the list of object ids referenced by a tree, without allowing access to the rest of the metadata (but this would require an additional datastructure for each tree).

If the above would be made possible, this opens up the way for using write-only / append-only kind of storage (where the backed-up client cannot delete backups, see #784), without having to sacrifice either automated pruning, or data security.

Regarding my previous comment (pruning without needing full access to the data): This also applies (possibly even stronger) to checking the backup. It makes sense to check a repo on the storage server for efficiency reasons (AFAICS to check a repo remotely requires transferring all content), or when implementing true write-only support (see https://github.com/ncw/rclone/issues/2499).

Also, for a true write only approach, changes are needed to restic to limit which files it needs to read (according to https://github.com/ncw/rclone/issues/2499#issuecomment-418609301). I'm not sure if that also needs repository format changes, if so it might be useful to include these here?

Checking and pruning a repository on the remote server would be really great. I'm in the process of setting up restic to back up multiple hosts and would like to do all maintenance tasks on the remote so that client setup is as easy as possible and only requires the backup to run regularly.

I would like to discuss some (maybe optional) additions to the snapshot file format:

  • Add list of used index files (see #1988)
  • Add possibility for user-defined data (like additions descriptions, etc, didn't find the issue right now)
  • Add statistical data like number of files/blobs, used space etc. This could make displaying statistics much faster and also allows for extra checks

About the pack file format I would like to question why not completely removing the header.
The information is redundantly included in the index files. There has been some discussion about adding redundancy for error-correction, but IMO this should (and can) be separated from the general repo format and may added or not added on top of that.

To put it short: If you do not need or want extra information for error correction, there is no need to duplicate information in pack header and index files. Index files are needed for performant operation and used everywhere. Pack headers are rarely used - and if then only for double-checking or error-correction.

Another proposal: Add username, hostname and contents of the config file to the data section of the key file. Thus completely remove the config file.

As already proposed, username and host should be present only in encrypted form. To check if the KDF-derived key is valid, it is already sufficient to check the MAC of the encrypted data section.
IMO it makes sense to put all information needed for identifying the key there. Adding the information which is stored in the config file ATM just removes an extra file from the repo and makes the repo initialization easier.

Sorry for the "dumb" question, but is there any serious effort going on to introduce an improved format any time soon? I've been following this issue for years. restic currently doesn't work well for large data sets, or when there are many snapshots, and it requires a lot of memory. It seems like the lack of compression and large overhead of the JSON encoded metadata are the big problems of restic. Maybe the effort has stalled because there is a perceived need to achieve "perfection"?

I'm also interested in what the future brings to restic. Especially in asymmetric crypto and compression.
By the way, for a new hash function, there is also blake3 which is very new and also extremely fast: https://github.com/BLAKE3-team/BLAKE3
If no decision was made already regarding a change in hash function, this might be interesting.

A few more ideas for repo.v2:

  • save tree and data blobs to different directories (in the past tree and data was mixed, but this was deprecated with introduction of cache).
  • add 'created time' info to data blobs.

This should simplify support for 'cold' storage with very slow or expensive download like amazon Glacier.

* save tree and data blobs to different directories (in the past tree and data was mixed, but this was deprecated with introduction of cache).

I don't like this idea.. It makes guessing file sizes of backups much easier while I do not see a benefit of it.

* add 'created time' info to data blobs.

I don't see any use of adding a "created time". Can you give a use case?

This should simplify support for 'cold' storage with very slow or expensive download like amazon Glacier.

I would say that "cold storage" support can be already achieved with the present repo format by adding some fine-tuning possibilities to restic and double-storing the never-the-less frequently used files, e.g. in a local cache. See also #2504

* save tree and data blobs to different directories (in the past tree and data was mixed, but this was deprecated with introduction of cache).

I don't like this idea.. It makes guessing file sizes of backups much easier while I do not see a benefit of it.

The benefit is well laid out in earlier comments on this very issue:
https://github.com/restic/restic/issues/628#issuecomment-280436633
https://github.com/restic/restic/issues/628#issuecomment-280833405
The results in the first comment also show that mixing these two types of blobs doesn't obscure file sizes in any meaningful way.

related forum post:
https://forum.restic.net/t/feature-using-an-index-for-restic-find/1773

@cfbao The comments you are referring to are about mixing tree and data blob within one data file (pack). Separating this was useful to enable cache handling. This is also already changed within restic.

However, I still do not see any benefit of saving tree and data blobs in different directories. Can you give a use case? (IMO the find forum topic is not related - I'll answer you there)

Separating tree and data blob entries in separate index files (e.g. directories "index-data" and "index-tree") would allow some improvements, though.

Tree blobs are already stored in separate pack files (this was introduced together with cache).
Just writing them to different directory will open a way to improve support of very slow-to-download storages (3-5 hours for Amazon Glacier standard). Like storing all metadata (index+snapshots+tree in regular S3, and data in Glacier).

2504 improves it a bit, but I don't like idea on relying on 'local-cache' or waiting a lot to fill cache.

I much more like idea to have some reverse proxy that will store tree+index+snapshots on regular S3 or any other place, but put data to deep archive.
In any case, surely it's possible to use restic as is with some limitations. But some format changes may improve/simplify things.

@cfbao as far as I see from code restic find will not benefit from this. It already walks over snapshots. Basically it's mostly same as restic ls <all-snapshots> | grep something.

@dionorgua
I like the idea of adding an arbitrary repository as "additional" cache where all except data packs is cached. This does not need a change in the repository layout and IMO is much more flexible.
I'm already working on this, see also #2516, last comment.

Maybe it's a stupid idea but what about a format compatible with borg or kopia ?

@aawsome

The comments you are referring to are about mixing tree and data blob within one data file (pack).

True. My bad. Yeah, this is the only thing I care about.

This is also already changed within restic.

Do you know in which PR/version this was changed? Last time I checked my repo, I noticed a mix of data & tree blobs in the same pack files. Any way I can (probably slowly) convert my repo to have better separation?

I still do not see any benefit of saving tree and data blobs in different directories. Can you give a use case?

I have no idea. As mentioned before, I don't really care about this.


@dionorgua

as far as I see from code restic find will not benefit from this. It already walks over snapshots. Basically it's mostly same as restic ls | grep something.

Wouldn't separating tree blobs from data blobs reduce the number of API calls necessary for find? If concentrated, the number of pack files that contain tree blobs would be reduced, and restic can download a smaller number of whole files instead of fetching many segments from many pack files. This matters for backends that are relatively slow and have stricter rate limiting (e.g. Google Drive).

Any way I can (probably slowly) convert my repo to have better
separation?

With a recent version of restic a run of 'prune' should rebuild these mixed packs..
Note that the actual implementation of 'prune' generates a lot of traffic for remote repositories. With the experimental reimplementation in #2718 you would be able to only repack mixed packs while having minimal traffic.

Wouldn't separating tree blobs from data blobs reduce the number of API
calls necessary for find?

In a recent version and with a repo that does have no mixed packs all needed information is locally cached.

Another idea for an improved repository format:

As we have seen it is beneficial to separate pack files by blob type (tree and data blobs go into different pack files). Wouldn't it be a good idea to also separate index files by blob type? The recent index PRs already go into the direction of separating index entries for tree and data blobs in the in-memory representation.
Also there are possible optimizations to load only part of the index

Doing this also in the repository would allow a more compact representation, e.g.

{
  "supersedes": [
    "ed54ae36197f4745ebc4b54d10e0f623eaaaedd03013eb7ae90df881b7781452"
  ],
  "type": "data",
  "packs": [
    {
      "id": "73d04e6125cf3c28a299cc2f3cca3b78ceac396e4fcf9575e34536b26782413c",
      "blobs": [
        {
          "id": "3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce",
          "offset": 0,
          "length": 25
        },{
          "id": "9ccb846e60d90d4eb915848add7aa7ea1e4bbabfc60e573db9f7bfb2789afbae",
          "offset": 38,
          "length": 100
        },
        {
          "id": "d3dc577b4ffd38cc4b32122cabf8655a0223ed22edfd93b353dc0c3f2b0fdf66",
          "offset": 150,
          "length": 123
        }
      ]
    }, [...]
  ]
}
Was this page helpful?
0 / 5 - 0 ratings

Related issues

fd0 picture fd0  ·  4Comments

reallinfo picture reallinfo  ·  4Comments

e2b picture e2b  ·  4Comments

shibumi picture shibumi  ·  3Comments

RafaelAybar picture RafaelAybar  ·  3Comments