Restic: Implement Compression

Created on 15 Nov 2014  Ā·  167Comments  Ā·  Source: restic/restic

This issue is a tracking issue for the purposes of tracking discussions and other issues/PRs related to the request for implementing compression.

The following issues/PRs are related to this topic (and may therefore be closed in favor of this one):

  • PR #2441
backend backup feature suggestion tracking

Most helpful comment

I think there has been enough discussion on the issue of adding compression. I can see it's a highly anticipated feature. I will tackle this next after finishing the new archiver code (see #1494).

Please don't add any further comments, thanks!

All 167 comments

When implementing this, add benchmarks and especially have a look at memory usage with -benchmem and benchcmp!

lz4, lzo, lzma, null. bz2 is rather slow.

snappy is fast with moderate compression

If compression is done per-chunk, care should be taken that it doesn't leave restic backups open to watermarking/fingerprinting attacks.

This is essentially the same problem we discussed related to fingerprinting the CDC deduplication process:
With "naive" CDC, a "known plaintext" file can be verified to exist within the backup if the size of individual blocks can be observed by an attacker, by using CDC on the file in parallel and comparing the resulting amount of chunks and individual chunk lengths.
As discussed earlier, this can be somewhat mitigated by salting the CDC algorithm with a secret value, as done in attic.

With salted CDC, I assume compression would happen on each individual chunk, after splitting the problematic file into chunks. Restic chunks are in the range of 512 KB to 8 MB (but not evenly distributed - right?).

  • Attacker knows that the CDC algorithm uses a secret salt, so the attacker generates a range of chunks consisting of the first 512 KB to 8 MB of the file, one for each valid chunk length. The attacker is also able to determine the lengths of compressed chunks.
  • The attacker then compresses that chunk using the compression algorithm.
  • The attacker compares the lengths of the resulting chunks to the first chunk in the restic backup sets.
  • IF a matching block length is found, the attacker repeats the exercise with the next chunk, and the next chunk, and the next chunk, ... and the next chunk.
  • It is my belief that with sufficiently large files, and considering the fact that the CDC algorithm is "biased" (in lack of better of words) towards generating blocks of about 1 MB, this would be sufficient to ascertain whether or not a certain large file exists in the backup.

AS always, a paranoid and highly unscientific stream of consciousness.

Thoughts?

Interesting. I don't understand how the attack you describe depends on whether compression is used, is it necessary at all? Doesn't this attack work with and without compression?

At the moment I'm thinking about how to implement #56. What are your thoughts on bundling together several blobs into a single file?

The exact workings of the CDC implementation is a bit unclear to me:

  • Do you split on exact byte boundaries or are the 512 KB - 8 MB blocks "rounded up" to a multiple of something?
  • (The real question is: are there (15 * 512 * 1024) / (16 because of AES-CTR) possible chunk size, or fewer?)
  • I'm also curious about how feasible it would be to reconstruct the seed given enough chunks of a known file - not very feasible, I'm guessing?

To answer your first question:
With the seeded CDC, the "fingerprint" depends on the (content + the secret seed), but the difference is that when compression is done _after_ the chunking, and assuming that you can distinguish individual blocks from each other, you have a fingerprint/watermark (the compression rate of a certain block) that depends solely on the content, in this scenario a known plaintext.

Example:
If the watermarked file contains 64 MB (8-128 chunks) of "AAAA" , then 64 MB of "ABCABCABCABC", then 64 MB of random data, the first 16-256 chunks would be very small (because those sequences compress very well, where the 8-128 chunks would compress rather poorly).
The attacker would also be able to work backwards, starting with the very last (24th - 384th) chunk and compress 512KB-8MB until the attacker finds a size that compresses to exactly the same chunk size. Once that is found, the "next" 512KB-8MB of the original plaintext is compressed to find out which length compresses to the length of the second-to-last block (23rd - 383rd), and so on, until the attacker meets the small chunks that are the result of the "AAAA" strings.
This doesn't enable an adversary to positively confirm that a watermarked file is stored in the backup, but I think that statistically it can create quite clear results, given enough data.

I see some potential solutions, perhaps you have more ideas:

  • Permit turning off compression and/or deduplication for certain directories (probably the easiest to implement)
  • Randomize compression dictionaries (haven't really thought this one through, but it seems like an interesting idea)
  • Prevent attackers from learning the lengths of individual compressed chunks, for example by padding (potentially quite expensive) or by bundling together several small chunks and padding the remaining (more complexity, but more efficient)

Thanks for your explanation, I understand your scenario now.

There will probably be no option to turn off deduplication in restic, because that's really hard to do given the current structure of the program, restic is built around CDC. Adding compression support is of low priority right now and not a goal for the alpha release.

Your third idea will be implemented in #56 (bundling together multiple chunks), I'm working on that right now. And I'll probably add more documentation to doc/Design.md regarding how the chunker works.

Thanks again for bringing up this scenario!

I am not sure if I follow @cfcs - isn't the compressed size just like an incredibly bad hash? Given a compressed file size, the number of possible inputs that generate that file size is endless. But probably I just don't understand.

Anyway. I just wanted to shamelessly point you to a modified deflate/gzip library I made. It might interest you that I put in a constant time compression mode, which enables ~150MB/s/core throughput on ANY data, which makes it almost invisible in backup scenarios. There is also a parallel gzip package that gzip-compresses bigger files on multiple cores.

@klauspost: Bear in mind that we are discussing the compressed sizes of individual chunks rather than files. A 100GB file with an average chunk size of 1MB will have about 100 x 1024 chunks, each leaking the compression ratio for a certain piece of the file. This results in much more statistical data than the size of a single compressed file, making it possible to compare a known plaintext to a compressed and chunked file even if the CDC salt (and thus the exact alignment borders of chunks) is unknown.

151 has been merged, though, so this is probably not an issue now.

I my opinion, this information leak is only of minor importance, considering we have a seeded chunker (via the custom per-repo polynomial) and blob packing (where an attacker cannot see individual chunks, but only groups of chunks and the number of chunks in an group, via the cleartext header length). I think it's a good idea to offer disabling compression completely for speed or privacy concerns, but the default (when this is implemented) will probably be "enable".

@klauspost I will definitely look at your library, thanks for pointing it out!

I agree with your observations above, @fd0, but I'd like to add that I think there can be another important use-case for selectively disabling compression for specific target files/directories/devices, for example when backing up multimedia formats that already compressed, binary files with high entropy that do not compress well, or when doing incremental backups of encrypted volumes.

@cfcs - ok - from your text I thought you implied you could deduce the data. So for this to make any difference, you would need the original data, which is rare.

Regarding non-compressible files, that is the reason I mentioned the constant time Huffman only mode I put into deflate, as the name implies it compresses all data at the same speed, and has automatic fallback to storing uncompressed data. So the maximum overhead is about 0.04% if the content is compressed.

Here are some benchmarks. The most applicable for backups is probably on the "Medium Compression", which is 10GB mixed content - compressible and uncompressible.

Defaulting to gzip Huffman only and having an option of specifying more CPU-consuming compression level could make sense.

In my opinion it could be valuable to selectively enabled/disable compression separately for data and tree objects. Especially large tree objects should compress really good.

I looked at some "points" where it could make sense to insert compression. The most transparent and versatile place would be somewhere between the repository and the backend.

I briefly looked at modifying Repository.Encrypt and Repository.DecryptTo, but we don't have types, and the varying sizes would make a mess of things.

My proposition is to implement compression and _encryption_ as a "backend", that both writes to an underlying backend. This will make compression and encryption transparent to the repository.

The reason we need to separate out encryption is that encrypted data doesn't compress (as you probably know).

repository.Repository

The compression algorithm can only be be set on initialization, and everything except the configuration is assumed to be compressed with that algorithm.

repository.Config

The configuration cannot be compressed. We add a string that indicates the decompression type to use for everything. Empty ("") is uncompressed. Otherwise it is the last part of the package name of the compression library used.

Note that compression levels can be changed between each run/type. There is no problem having a repo where some snapshots/types are deflate level 0 (store) and some at level 9 (best compression) - as long as the decompressor is the same.

type Config struct {
    Version           uint        `json:"version"`
    ID                string      `json:"id"`
    ChunkerPolynomial chunker.Pol `json:"chunker_polynomial"`
+   Compression       string
}

Compression is added as create parameter:

-func CreateConfig(r JSONUnpackedSaver) (Config, error) {
+func CreateConfig(r JSONUnpackedSaver, compression string) (Config, error) {

The backend is replaced after LoadConfig/CreateConfig. Here is an example of how it could look:

// SearchKey finds a key with the supplied password, afterwards the config is
// read and parsed.
func (r *Repository) SearchKey(password string) error {
    key, err := SearchKey(r, password)
    if err != nil {
        return err
    }

-   r.key = key.master
-   r.keyName = key.Name()
    r.Config, err = LoadConfig(r)
+   r.be, err = FindCompressor(r.Config.Compression, Encryption(key, r.be))
    return err
}

Compression Implementation

The compressor can implement selective/adjustable compression for some types. Since it is seen as a "backend", the compressed size will never we visible to the repository. The compressor must be summetric with all settings.

Issues

HELPME/FIXME: "Packed" files seems like a problem, since encryption re-starts at each file. If encryption is moved to the backend, encryption will be for the entire blob, not for each file. I assume that is a problem, and I have no good solution.

TODO: Find some good way for parameters/configuration to be sent to the compressor. Not needed for a first implementation.

TODO: Do we care about the on-disk sizes? If the above is implemented, the repository will not know it.

Thanks for sharing your thoughts, here are mine:

I think that compression and encryption need to be integrated with one another, I will think about this. At the moment, I don't like the thought of completely abstracting the compression/encryption layer from one another. As you already described, we must take care to not do stupid things, e.g. compress encrypted data. In addition, we should offer an option to disable compression in favor of speed and/or security concerns. Regarding encryption: There may be a non-crypto mode for restic later, but for now crypto isn't optional.

Also the packed files are a level of abstraction in itself (e.g. stuff can be repacked), which doesn't work with abstracting crypto.

I think the Repository object is way too complex and needs a general overhaul before tackling this. I don't really have good plan ready for it right now.

Regarding the different compression algorithms and options: I think we should select one algorithm (including a set of parameters) for data, and maybe a second one for compressing the JSON tree structures, but that's it. For all optional things (e.g. different configurable compression algorithms and/or parameters) I'd like to have a sound consideration weighting the benefit against the added complexity and amount of additional code.

Please don't get me wrong, I'd like to add features to restic, but especially for changes that modify the on-disk format, I need very good arguments. In the general case of adding compression, I can see the benefit, but the complexity and changes to the on-disk format must be managable.

you'll need different algorithms and parameters (and not just per repo, but even per backup run), there is no "best for every usecase" compression.

just as a practical example:

i do backups from my company server to my backup server at home. dsl uplink with ~700kbit/s.
for this i want the best compression (like lzma + high level). the cpu has lots of spare time at night while waiting for the crappy connection to accept the next packet.

to the same repository i also backup my laptop when i am at home, there i have a wireless "N" connection. of course i don't want to slow down the connection by using lzma + high level there, but rather i want something very quick that doesn't slow down at all - like lz4. i still want compression though, not using lz4 would take about 2x the space.

I want lzma here but lz4 there (rephrased :-))

That's all bikeshedding imho... Lets pick a reasonable default and not expose too much configuration, that will just introduce lots of complexity for little gain, imho.

I think the Repository object is way too complex and needs a general overhaul before tackling this. I don't really have good plan ready for it right now.

Fair enough. I started looking through repository.go and found all the places you would insert a compression/decompression step, and the added complexity was not a good thing. By implementing it as a backend interface, you could suddenly take out all the encryption, and make it part of a backend chain. You could make a generic test that ensures symmetry of the backend parts, including compression and encryption.

Regarding encryption: There may be a non-crypto mode for restic later, but for now crypto isn't optional.

Which is why backend chaining makes it so nice. You just leave out the encryption (or create a passthrough backend) and it seamlessly works.

In the general case of adding compression, I can see the benefit, but the complexity and changes to the on-disk format must be managable.

This was the least intrusive way I can see. If there is a way to fix the 'packed file' issue, uncompressed repos remains fully backwards compatible; old client will be able to operate on them as before alongside new ones.

Compressed repos will obviously not, but we can change version to 2 only if the repo is compressed, this will make older clients fail. The check should then obviously be changed to if cfg.Version > RepoVersion {, but that has no compatibility problems.

That's all bikeshedding imho... Lets pick a reasonable default and not expose too much configuration, that will just introduce lots of complexity for little gain, imho.

Agree. Most algorithms (lzma/deflate) have plenty of flexibility within the same decompression format.

To test compressibility there is DataSmoke: https://github.com/Bulat-Ziganshin/DataSmoke

Also, pcompress chooses a nice assortment of compression algorithms: https://github.com/moinakg/pcompress

The squash compression abstraction library has a good list of algorithms and a benchmark: https://quixdb.github.io/squash/

There is a text compression benchmark here: http://mattmahoney.net/dc/text.html

Easy approach is to always filter inside crypto/crypto.go Encrypt/Decrypt functions.

gzip-compression-v1.patch.txt it's proof-of-concept diff that could be applied to head.

Thanks for trying @mappu, but before implementing this we need to agree on a strategy. Open questions (from the top of my head) are at least:

  • When is compression applied? (Data? Metadata/JSON?)
  • Which algorithm should we implement? I think @klauspost has some suggestions for us :)
  • How do we store this in a repository without breaking clients?

When is compression applied? (Data? Metadata/JSON?)

Data obviously yes.
Metadata/json, maybe it's pointless but i think it could help for large metadata files since JSON data is mostly ASCII and would benefit from arithmetic coding / huffman phase (gzip has).

Since data/metadata is always encrypted, I think adding to the encrypt/decrypt routine is a simple way to catch all usages, without this "I started looking through repository.go and found all the places you would insert a compression/decompression step, and the added complexity was not a good thing." of @klauspost .
Also since blobs are stored named hash(plaintext) not hash(ciphertext) {{obviously necessary for dedup otherwise random IV would destroy dedup}}, it's safe to do this without hurting dedup.

Which algorithm should we implement? I think @klauspost has some suggestions for us :)

don't care Although i agree with @ThomasWaldmann that it should be configurable. At least for a --best and a --fast use case.

I would suggest gzip only because it's pure go, it's in golang standard library, and receives performance attention from Google. xz is much stronger slow compression. lz4 is much weaker fast compression. gzip is balance and easily tuned, even if, it doesn't reach either extreme.

How do we store this in a repository without breaking clients?

Repository should be compatible in one direction only. I think it's OK if old restic can't read new repo, as long as, new restic can read old repo.

Maybe you could add tag byte after the MAC. Not present - no compression (old restic). Then, byte can also indicate which compression algorithm was used. 0x01 gzip 0x02 lz4 or so.

It seems that same (or worse) problem as in attic is present (no compression type/parameter byte(s) used in current format).

In attic (borg) I was lucky as the old format was gzip-only and gzip format can be detected from the first 2 bytes. So I kept gzip without additional bytes (same as old repos) and added type bytes (for no or other compression) that are never ambiguous with the first 2 gzip bytes.

Obviously you can't do it like that if the old format is just raw, arbitrary data.

No. But there are other ways to signal this information. e.g. if IV at start of encrypted chunk is exactly "NEWFORMAT" (1 :: 2^xyz chance of collision) then parse as new format. It's not so "clean" but i think fine in practice.

as it's done with EXTENDEDPROTOCOL in nmdc lock handshake.

Oh no, we will not do something ugly as this, and we don't need to. Should we decide to implement this, the pack files have a field type for each blob. This is an uint8, and at the moment only defined for data and tree, we can easily add compressed data and compressed tree. https://github.com/restic/restic/blob/master/doc/Design.md#pack-format

At the moment, I don't consider this feature a high priority.

Adding new blob types to pack format is OK for data compression, but, doesn't provide index compression. Since indexes can become large, and JSON has good compression ratio, and maybe the index doesn't have local cache, so i think it's important to compress indexes also.

It's important that new-restic works with old-repo, to allow easy upgrading. It should be either seamless (preferred) or have an restic upgrade-repo tool (not preferred).

So How about this?

All restic commands already do first load+decrypt the config.

  • if first config byte is { (it's json object first byte), then entire repo is old format (uncompressed)
  • otherwise, first config byte is {tag byte} and entire repo is new format. {tag byte} at start of decrypted data indicates compression format. example 0x00 uncompressed 0x01 gzip

Thanks for the proposal. I think it is unnecessary to compress the config, as this is only a really small file and can always stay in JSON format, and we can add fields as necessary, e.g. whether the index files are compressed or not.

I just found restic yesterday while searching for a good backup solution. One of my main concerns is limiting how much space my data takes up. Especially if I'm sending data to places I pay for, like S3. Dedup definitely will help, but I kind of expected compression would be part and parcel of a backup solution... In https://github.com/restic/restic/issues/21#issuecomment-185920429 you ( @fd0 ) say this is a low priority, could you explain why? Is there a roadmap I could look at anywhere?

Also, +1. ;)

At the moment I'm working on removing old backup data (#518). It's not easy to get compression right and secure at the same time, and I need to think a bit more about how to integrate this into the repository format.

We will implement compression (after all that what this issue is about), it just hasn't been done yet. restic is a rather new project, please bear with us :)

This issue is related to #116. Because of encryption we cannot compress the backup after with an other tools isn't it? What priority do you have between compression and to make encryption optional? (I bet for compression first!)
_Sorry to make pressure about this, you're right that repository format must be taken with care!_

This is easy to answer: Compression will be implemented first.

This is because I don't have any plans at the moment to make encryption optional. I think it's very hard to get right, too. We'd need to think about integrity, as this is a thing that shouldn't be optional, but (at least at the moment) it is tightly coupled to the encryption.

@fd0 Thanks for answering my question. Makes me wish my dev skills were up to helping on this. But I've only barely touched go, and most of my other xp is in webdev or sysadmin scripts.

I totally agree that you need to make sure compression is done "right and secure". If that delays things, so be it. :smile:

I implemented snappy compression in restic here: https://github.com/viric/restic/tree/snappy

It's just a proposal. Basically, I added snappy compression/decompression for blobs in packs, and I use a bit of the blob type byte as a mark. I also added a field in pack indices: PLength (plaintext length), which was until then not stored but calculated as "bloblength - crypto.Extension".

I noticed that for some of my backups it not only takes less space, but it even works faster (less data to handle).

All restic tests pass fine. It can work over previous restic repositories, but normal restic (that of master) cannot handle the new blobs.

I used snappy (https://github.com/golang/snappy) because I thought it would be less affecting the aim for speed of @fd0.

Added $50 bounty for compression landing on master

Bountysource

As mentioned above, there should be an automated, non-configurable way to avoid trying to compress uncompressible files like media, encrypted or already compressed ones. The problem is exacerbated by some container formats like PDF whose contents are sometimes compressible, sometimes not.

It would be easiest to just use some algorithm that transparently handles this, like the constant time compression mode mentioned in the first comment by @klauspost.

Otherwise there would be a need for filetype lists: a blacklist never to be compressed, a whitelist always to be compressed, an a heuristic for the rest that tries to compress a small fraction of the file, and gives up if the size reduction is not over a given threshold.

Not sure how well this would map at the chunk, rather than at the file level.

I'd argue against adding it to the encryption/decryption pass.
We don't want to mix different sorts of data, since some of it may be predictable, and the resulting pack/blob lengths can leak information about the plaintext of the unpredictable/secret data.
I think it should be per-file, even if this makes it "less nice". That, however, comes with the benefit of not having to do stray decompression of tons of pack files (where only one blob in each matters) to read a file.

@teknico

As mentioned above, there should be an automated, non-configurable way to avoid trying to compress uncompressible files like media, encrypted or already compressed ones.

My modified deflate package implements skipping of already compressed data and does so at a rate of ~250MB/s per core. The Go 1.7 deflate only supports that at the fastest compression levels.

Snappy and LZ4 supports similiar skipping functionality.

It would be easiest to just use some algorithm that transparently handles this, like the constant time compression mode.

It should definitely be an option. In Go 1.7 (now called HuffmanOnly and my equivalent) this mode supports ~200MB/s per core no matter the input. However compression is severely hampered compared to "best speed", which typically operates at 80 MB/s/core.

@cfcs

I think it should be per-file, even if this makes it "less nice".

Generally I agree. I will have to read up on restic. Is the binary size of each pack size available non-encrypted?

@klauspost It looks like some of your improvements got merged into Go 1.7 DEFLATE "BestSpeed" mode, is that correct? Maybe that would be a reasonable default.

The advantage of using the DEFLATE format is that there are many different compressors available that produce compatible bitstreams, so it is completely transparent to the decompressor.

Due to the nature on how restic works (split files into blobs, only handles blobs afterwards) the easiest way to add compression is on the blob level. Maybe we could add some heuristics to decide whether or not a blob should be compressed, but that can be the second step.

Blobs are combined into pack files, which are then stored in the repository. A pack file contains a number of (separately encrypted) blobs, followed by an (encrypted) header, followed by the (unencrypted) header length. Attackers without the decryption key only see ciphertext, the header length and the file length. So based on the size of the pack file and the header length attackers could compute the average size of a blob in a particular pack file, but that's it. The Index files also contain all data (size, encrypted size, and later then maybe the compressed size), but those are also encrypted. I don't see any risk here.

A "compressible" test heuristic is both prone to errors and rather expensive. I would estimate it would be hard to get much above above 200 MB/s/core - that is the speed of the order 1 lookup in the dedup package on AMD64.

Also, it would depend a lot on the compressor used. Snappy would not be able to compress random base 64 data, but deflate would for instance, so I would leave that part to the compressor - we have that built in for Snappy, LZ4 and deflate.

@fd0 sorry, I meant per-blob, not per-file.
Unless we choose a lightweight algorithm, compression, being somewhat CPU-heavy, is likely to become a bottleneck (next to AES, which in the future will hopefully be taken care of by AES-NI).

@fd0 - I made a quick "compressability estimater": https://play.golang.org/p/Ve5z3txkyz - it estimates predictability and entropy of arbitrary data. Though, as I mentioned, it should rather be up to the compressor to decide.

borg 1.1 will have 2 "compression deciders":

  1. decide per file based on path pattern match (*.zip, *.mp3, /htdocs/photos/*, ...)
  2. if undecided, decide per chunk, use lz4 as test for compressibility - if that compresses, compress again with the desired compression (lz4, zlib, lzma), if not, do not compress.

@klauspost hm, that test is not so bad on my machine:

BenchmarkCompressibility-4           100      10345544 ns/op     810.84 MB/s

Benchmark code is here: https://gist.github.com/908c23123dda275a479cf931f2784f5d

Lz4 does not have an entropy coder, so it'll false negative lots of times
probably?

I think we need three modes (globally):

  • Compress all data blobs with a linear time compressor (default)
  • No compression
  • Max compression (for people with a lot of CPU power, but only small bandwidth)

I'd like to always compress tree objects (JSON), so we should select a proper algorithm for ASCII text.

Otherwise I'll work with @viric to build a prototype, then we can reason about a concrete implementation.

Thoughts?

@klauspost hm, that test is not so bad on my machine

I always forget how badsh*t insanely well the new Go compiler can do. At least twice of what I expected.

I think we need three modes (globally):

Deflate does all 3 pretty well, although there are more efficient compressors (LZMA mostly). Deflate without compression is of course unneeded, but is of course fast and with minimal overhead, so a general deflate approach could be used, with the possibility to specify others later.

I have started looking at another speedup, which would be between level 1 and Huffman both in terms of speed and compression. However, time is a bit precious at the moment, and I still need to test a backport of some of the final Go 1.7 changes before I can move on to new things.

If you just want a single compression algorithm you should take a look at the new contender zstd: https://github.com/facebook/zstd

It was developed by the same developer as lz4 and has a better compression ratio than gzip while being more than 3 times faster: https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/

zstd looks very promising, although I was unable to find an implementation in Go.

The official site http://facebook.github.io/zstd/#other-languages links to this Go implementation: https://github.com/DataDog/zstd

Or do you mean a pure Go implementation?

Yes, meant a pure Go implementation. At the moment, restic doesn't depend on any C code, and ideally I'd like to keep it that way.

Is there any forecast to implement compression?

Implementing compression depends on changing the repository format (planning/ideas are in #628), which requires great care. So, no, there's no definitive date when compression is added ;)

Is there anything we can do or contribute to help make this happen?

I don't think so, sorry :wink:, it just needs time.

So I thought I can use my test bed from #790 one more time. In my build of restic I removed all encryption, then made a full backup once again. It came in the same size as encrypted - no surprises here. But then I compressed the repository, and what I found is:

35G backup-unencrypted
6.4G    backup-unencrypted.tgz2

What a difference! For comparison, here's size of a single database dump compressed:

1.7G    single-backup.sql.gz

I have 29 of these above. About 100 times savings compared to regular backups!

Since I found all places where encryption added, I think I can add very simple configurable compression with a stock gzip implementation, with possibility to use a different compression engine in the future. Any objections?

(I will probably give myself two weeks worth of evenings to succeed or fail.)

Thanks for your research and posting the results here! I expected similar results. To be honest with you: I won't merge anything that removes the crypto, or even makes it optional. That's something we can do later, but it must be carefully planned.

Adding compression might seem easy at first, but it isn't. We need to be very careful not to accidentally make restic vulnerable to unexpected attacks (this has happened to the TLS protocol several times in a row (yes, I'm aware that this is a different situation)).

The most important thing in the whole project is not the code: It is the repository format. Users trust us with their data, and they depend on being able to restore data after using restic a long time, so repository format stability is of utmost importance. So in order to support compression, we first need to decide (and implement) the next version of the repository format. The discussion is here: https://github.com/restic/restic/issues/628

I can see that you are very eager to implement this (and even contribute code), but please do not spend any time on this until we agree on the repository format and have discussed all the angles of this problem. Thanks!

As for removed crypto, I ain't going to propose to merge that. I did that only to see if a compression will work. And yes, this had to be carefully planned (no one wants to suddenly lose the ability to verify a repository with encryption disabled).

Since we use json.Unmarshal we could add as many new keys to the config as we want. If they're not found in the JSON, they'll just keep their default values.

Algorithm choice not the main point, understood: but just for future reference, Brotli seems a strong contender.

For all I know Brotli compression is very slow (iirc 60 times gzip), so it is recommended for data that are read very frequently compared to being written and compressed which is probably not common for backups. But yeah, let's not go into details yet :)

This gives a good overview for the different compression algorithms.

Brotli is always faster or has a better compression. Depends on the compression level.

@ibib How do you reach that conclusion? It seems to me likebrotli appears slower than most of the others (on the mixed datasets) while not achieving particularly amazing compression rates. Maybe it's better for specific kinds of structured data?

As listed in the comparisons at the Squash benchmark there are three parameters to go by:

  • Achieved compression ratio: How well it compresses is important in order to save disk space (and I/O to the backend).

  • Compression speed: Is important because we are going to be doing this operation every time we add a block, so we really want something that can keep up with AES-NI and general I/O so as not to become a bottleneck. We do probably not want to pick an algorithm that compresses slower than it decompresses since we have the opposite use case of web browsers that these new algorithms (like zstd, lz4, brotli) are optimized for (we have "compress often, decompress seldom" as opposed to "compress once, decompress often").

  • Decompression speed: The decompression speed is only relevant when we are restoring. If we are okay with a slow restore, we can accept a slow decompression speed. On the other hand we also have metadata that we do not want to decompress slowly, so that might even warrant two different algorithms.

It appears that density is among the fastest, albeit not particularly efficient in terms of compression ratio. In terms of not being a bottleneck it seems like it'll get us (on avg) a 2:1 compression ratio almost for free. If we want 4:1 we'll have to pick a different algorithm, but then we'll end up sitting around and waiting for it.

We also have two (at least?) different kinds of data: The indexes; and the data chunks. The two are used differently, and I guess it could be discussed whether it would make sense to choose different algorithms for them. I personally think we should stick with one algorithm (whatever we choose) so that re-implementing Restic (in a new language or whatever) is not made unreasonably hard. And so that we don't expose ourselves to bugs from two exciting compression algorithms since those are hard to test for corner cases.

I have to disagree with your recommended trade-offs. Backups can run in the background at a convenient time (maybe with nice 10). Restoring them happens under time pressure. The relevant trade-off I see is between block size and compression ratio. Too small blocks won't compress well and increase the metadata overhead. Too large blocks reduce the dedup ratio. For most compression algorithms the higher level settings won't improve compression ratios for small inputs.

Additionally higher compression ratios allow users to keep more versions of their data in the same space.

Remember that my tests with snappy had the outcome of: 1) smaller backup size (it compresses, normal), and 2) faster backup and restore (less data ciphering, HMAC and transfer). Using a very cheap laptop.

@cfcs I refered to the comparison of gzip and brotli
image
Here brotli has always the faster and better compression.

@Crest That is fair enough, we probably have different use-cases -- I just don't use restic the same way you do. I do backups of my laptop and want it to finish quickly so I can leave with my laptop. I take it you're talking about backups of servers or other constantly-connected machines where the backup-rate is not so important. Similarly I never need to restore all of my data under time pressure; if there's a time pressure (because you use it in professional contexts?), I can selectively restore the data I need, and proceed to do the rest later on.

You make a very good point about the "many times small inputs;" that is important to consider when considering these benchmarks.

@viric The effect you're referring to is considered in the Squash benchmarks under the section called Transfer + Processing :-)

@ibib ah, gotcha!

@ibib can you link where you got that chart from?

I have been doing some tests with brotli and zstd, and I noticed that my results don't match at all those of the squash benchmark. Then I understood that that benchmark is 1,5 years old.

zstd works very well for me. Fast + high ratio, and its "level" allows a very big span between fast and high ratio. Great thing.

Brotli works very slow for me, with no better compression ratio than a much faster zstd. And brotli seems focused at small files of English texts (it includes an English dictionary). For html compression or similar.

More recent benchmarks I found: https://github.com/inikep/lzbench

So I threw my test bed one more time against zbackup with LZMA compression.

35G backup-unencrypted
6.4G    backup-unencrypted.tgz
2.5G    zbackup

Impressive, isn't it?

Suffice to say zbackup has own set of limitations and drawbacks.

So, according to @viric lzbench link, the most appropriate compressor is one that doesn't slow down backups (high compression speed?, >200 MB/sec), that actually has a good compression ratio (>=50), right?

So, I've filtered out the results from the ordered-by-ratio table.

I've also done a _quick_ search for Go implementations (that's why I've kept them in the table). Strikethrough means that I've didn't found any implementation, which eliminated almost everything. Since it was just a quick search, I've keep the results. Exception for zstd that is just a wrapper.

| Compressor name | Compression| Decompress.| Compr. size | Ratio |
| --------------- | -----------| -----------| ----------- | ----- |
| zstd 1.1.4 -1 | 242 MB/s | 636 MB/s | 73654014 | 34.75 |
| lizard 1.0 -30 | 258 MB/s | 867 MB/s | 85727429 | 40.45 |
| density 0.12.5 beta -3 | 253 MB/s | 235 MB/s | 87622980 | 41.34 |
| gipfeli 2016-07-13 | 233 MB/s | 451 MB/s | 87931759 | 41.49 |
| pithy 2011-12-24 -9 | 257 MB/s | 1263 MB/s | 90360813 | 42.63 |
| pithy 2011-12-24 -6 | 295 MB/s | 1268 MB/s | 92090898 | 43.45 |
| quicklz 1.5.0 -1 | 346 MB/s | 435 MB/s | 94720562 | 44.69 |
| lizard 1.0 -20 | 284 MB/s | 1734 MB/s | 96924204 | 45.73 |
| pithy 2011-12-24 -3 | 352 MB/s | 1222 MB/s | 97255186 | 45.89 |
| lzrw 15-Jul-1991 -4 | 243 MB/s | 392 MB/s | 100131356 | 47.24 |
| lzo1x 2.09 -1 | 394 MB/s | 551 MB/s | 100572537 | 47.45 |
| lz4 1.7.5 | 452 MB/s | 2244 MB/s | 100880800 | 47.60 |
| fastlz 0.1 -2 | 243 MB/s | 469 MB/s | 100906072 | 47.61 |
| lzo1y 2.09 -1 | 397 MB/s | 556 MB/s | 101258318 | 47.78 |
| lzo1x 2.09 -15 | 406 MB/s | 549 MB/s | 101462094 | 47.87 |
| density 0.12.5 beta -2 | 480 MB/s | 655 MB/s | 101706226 | 47.99 |
| lzf 3.6 -1 | 251 MB/s | 565 MB/s | 102041092 | 48.14 |
| snappy 1.1.4 | 327 MB/s | 1075 MB/s | 102146767 | 48.19 |
| blosclz 2015-11-10 -9 | 220 MB/s | 696 MB/s | 102817442 | 48.51 |
| pithy 2011-12-24 -0 | 384 MB/s | 1221 MB/s | 103072463 | 48.63 |
| lzo1x 2.09 -12 | 418 MB/s | 550 MB/s | 103238859 | 48.71 |
| lizard 1.0 -10 | 360 MB/s | 2625 MB/s | 103402971 | 48.79 |
| fastlz 0.1 -1 | 235 MB/s | 461 MB/s | 104628084 | 49.37 |
| lzrw 15-Jul-1991 -3 | 226 MB/s | 449 MB/s | 105424168 | 49.74 |
| lzf 3.6 -0 | 244 MB/s | 550 MB/s | 105682088 | 49.86 |
| lzo1x 2.09 -11 | 424 MB/s | 560 MB/s | 106604629 | 50.30 |
| lz4fast 1.7.5 -3 | 522 MB/s | 2244 MB/s | 107066190 | 50.52 |
| tornado 0.6a -1 | 233 MB/s | 334 MB/s | 107381846 | 50.66 |
| memcpy | 8657 MB/s | 8891 MB/s | 211947520 |100.00 |

LZ4 looks like the most approriate compressor?

guess you want lz4 (>=1.7.0 r129) and zstd (>=1.3.0), if any. we also use these for borgbackup.

BUT zstd is very tunable with one single integer, from lz4 speed to better
than xz compression. That would make happy users of dense slower
compression and those of fast compression. Not to mention that zstd
decompresses very fast, regarless of the compression effort.

lz4 is quite narrow purposed.

On Sat, Dec 16, 2017 at 09:50:49AM -0800, TW wrote:

guess you want lz4 and zstd, if any. we also use these for borgbackup.

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/restic/restic/issues/21#issuecomment-352199097

--
(Escriu-me xifrat si saps PGP / Write ciphered if you know PGP)
PGP key 7CBD1DA5 - https://emailselfdefense.fsf.org/

Well.. according to https://github.com/restic/restic/issues/21#issuecomment-250983311 of keeping restic dependency free, zstd isn't an option, for now. Also, there are a few threads about patents/licensing issues.

As for xz and high compression ratios, even for the lower compression settings, according to the table, the fastest compression is about 15MB/sec.

If the requirement for fast backup is lowered, let's say, >=30MB/sec, we could add:

| Compressor name | Compression| Decompress.| Compr. size | Ratio |
| --------------- | -----------| -----------| ----------- | ----- |
| xz 5.2.3 -9 | 1.70 MB/s | 56 MB/s | 48745306 | 23.00 |
| xz 5.2.3 -6 | 1.89 MB/s | 58 MB/s | 49195929 | 23.21 |
| xz 5.2.3 -3 | 4.18 MB/s | 55 MB/s | 55745125 | 26.30 |
| zstd 1.1.4 -8 | 30 MB/s | 609 MB/s | 61021141 | 28.79 |
| zling 2016-01-10 -2 | 32 MB/s | 136 MB/s | 61917662 | 29.21 |
| xz 5.2.3 -0 | 15 MB/s | 44 MB/s | 62579435 | 29.53 |
| zling 2016-01-10 -0 | 38 MB/s | 134 MB/s | 63407921 | 29.92 |
| zstd 1.1.4 -5 | 88 MB/s | 553 MB/s | 64998793 | 30.67 |
| lzfse 2017-03-08 | 48 MB/s | 592 MB/s | 67624281 | 31.91 |
| libdeflate 0.7 -6 | 64 MB/s | 609 MB/s | 67928189 | 32.05 |
| brotli 2017-03-10 -2 | 98 MB/s | 289 MB/s | 68085200 | 32.12 |
| zstd 1.1.4 -2 | 185 MB/s | 587 MB/s | 70164775 | 33.10 |
| tornado 0.6a -4 | 91 MB/s | 197 MB/s | 70513617 | 33.27 |
| libdeflate 0.7 -3 | 96 MB/s | 602 MB/s | 70668968 | 33.34 |
| xpack 2016-06-02 -1 | 98 MB/s | 506 MB/s | 71090065 | 33.54 |
| tornado 0.6a -3 | 119 MB/s | 188 MB/s | 72662044 | 34.28 |
| libdeflate 0.7 -1 | 117 MB/s | 570 MB/s | 73318371 | 34.59 |
| lizard 1.0 -42 | 90 MB/s | 938 MB/s | 73350988 | 34.61 |
| zstd 1.1.4 -1 | 242 MB/s | 636 MB/s | 73654014 | 34.75 |

There are multiple deflate implementations, but unsure if they are comparable.
Left xz for reference
zstd looks so promising. Too bad there's no Go implementation

@viric zstd is not quite lz4 speed.

but if one would want to have just one compressor rather than multiple, zstd is more flexible.

Forgive my lateness. Some comments:

Compression speed: Is important because we are going to be doing this operation every time we add a block, so we really want something that can keep up with AES-NI and general I/O so as not to become a bottleneck. We do probably not want to pick an algorithm that compresses slower than it decompresses since we have the opposite use case of web browsers that these new algorithms (like zstd, lz4, brotli) are optimized for (we have "compress often, decompress seldom" as opposed to "compress once, decompress often").

No, it's not necessary to compress at hardware-accelerated AES speeds. Compression is about trading time for size. It's completely expected that compressed backups will take longer.

For example, rather than using restic on my personal backups, I'm still using Obnam, because on one of the small servers I store them on, if they were not compressed, they wouldn't fit. The backups already take hours, and they run in the background so I don't even notice.

I don't care if restic's compressed backups take longer. Indeed, I expect them to, and it's the tradeoff I need to make. Without this kind of compression, restic wouldn't be useful for me.

Decompression speed: The decompression speed is only relevant when we are restoring. If we are okay with a slow restore, we can accept a slow decompression speed. On the other hand we also have metadata that we do not want to decompress slowly, so that might even warrant two different algorithms.

Restores are done far less frequently than backups, so decompression speed is not as important. Someone mentioned that they are often done under time pressure: this is true, but it doesn't mean that restores need to be as fast as backups, or anywhere close to it.

We also have two (at least?) different kinds of data: The indexes; and the data chunks. The two are used differently, and I guess it could be discussed whether it would make sense to choose different algorithms for them.

It might not be necessary (or necessarily a good idea) to compress the indexes at all. Being indexes, it seems unlikely that they will compress well in the first place, as their whole purpose is to store unique data.

I personally think we should stick with one algorithm (whatever we choose) so that re-implementing Restic (in a new language or whatever) is not made unreasonably hard. And so that we don't expose ourselves to bugs from two exciting compression algorithms since those are hard to test for corner cases.

I understand these concerns, but I think that would be a mistake. At the least, the repo format must allow for multiple compression algorithms so that newer ones can be added in the future. There should probably be pluggable modules for compression so users can select the ones they want to use, e.g. I could imagine Debian packages like restic-xz, restic-zstd, etc. that users could install if they wanted to use those algorithms. The compression of backup data should be abstracted so that restic hands a compression function some data and gets it back compressed, and restic shouldn't care what happens in between; the same for decompression.

If the requirement for fast backup is lowered, let's say, >=30MB/sec, we could add

That seems reasonable to me. Remember that local backups are only one kind; network backups are less likely to be bottlenecked by compression speed.

But, again, this should be tunable by users so they can select the appropriate solution for their needs.

Added 10$ bounty for a :beer: :)
image

šŸŗ++

Here is a link to the BountySource if someone else would like to contribute
badge
https://api.bountysource.com/badge/issue?issue_id=6096108

I wonder if this can be implemented in a user configurable manner so that choice of speed vs size left to the user to decide. I would prefer higher compress as a default.

Let's decide when we get there. For the record: I'm fine with giving the user a bit of control in terms of speed vs. size.

+1 for restic needing a compression implementation. I'm using restic to backup VM images to backblaze and would love to be able to compress those before uploading. In my use case I would trade an almost infinite amount of time/CPU to reduce the size of the transferred/stored data. I realize speed is more of a concern for some though. Having a plug-able architecture where multiple algorithms can be selected is key.

I'm happy to help test as this is looked at further.

@fd0 It has been a while since I worked on the restic code base. Is it possible for you to give a quick direction on a good approach and where I should look?

@klauspost It's not so much adding compression on a technical level, that's rather easy to do, but how we handle upgrading the repo format in a backwards-compatible way. I'm currently busy rewriting the archiver part (so that ugly things like #549 go away), after that I'd like to add compression and then switch to repo v2.

What is your take on which compression algorithm we should use? I'm thinking about supporting three modes:
1) No compression
2) "Linear time" compression (does not add much CPU load)
3) "Max compression"

Maybe the first and second mode will be the same, I'm not sure yeth

It'd be most awesome to be able to use something like zstd, but as native Go code. Damian hinted that it may not be much work to port either the Java or the C version: https://twitter.com/dgryski/status/947259359628738560, is there anything I can do to get you interested in trying that? :)

I have looked at zstd format spec, and to me it is not trivial to implement (well). The Java sources are only decompression.

For fast compression, LZ4 should do very fine. The Go port is excellent. zstd would be better, but I would go with a tried-and-tested package, unless youwant to use the cgo implementation.

For middle-of-the-road, deflate compression is still good speed/compression-wise. Well tested, etc.

High compression is a bit more tricky. It does, however, seem like there is a native LZMA(2) Go implementation in the github.com/ulikunitz/xz package. There are some caveats about stability and performance on the README. There is no need for the xz wrapper, since you have uncompressed hash and size already. I can give it a whirl and see how it compares.

I took a look at the source to find the natural place to insert the compression step. What made sense to me was to have compression here and decompression here. But I see the challenge in identifying and keeping track of compression.

You could also take a look at a "compressibility estimator" I made. It will give a quick estimate of how compressible a blob of data is. It typically operates at >500MB/s, so it could be used to quickly reject hard-to-compress data.

You could also take a look at a "compressibility estimator" I made. It will give a quick estimate of how compressible a blob of data is. It typically operates at >500MB/s, so it could be used to quickly reject hard-to-compress data.

Love the compressibility estimator! Avoiding attempts to compress uncompressible data would gain a lot of speed.

Zstd has something like that built-in: [1]

Zstd pass faster over incompressible data. Expect something > 1 GB/s

Although I haven't found any explicit benchmarks of that.

The xz package looks like a good deal for lzma. Did some quick tests with the default settings:

| algorithm | level | insize | outsize | millis | mb/s | ratio |
|-----------|-------|------------|-----------|--------|--------|--------|
| lz4 | - | 1000000000 | 625968314 | 5454 | 174.85 | 62.60% |
| flatekp | 1 | 1000000000 | 391051805 | 12367 | 77.11 | 39.11% |
| flatekp | 5 | 1000000000 | 342561367 | 20164 | 47.3 | 34.26% |
| flatekp | 9 | 1000000000 | 324191728 | 43351 | 22 | 32.42% |
| lzma2 | | 1000000000 | 291731178 | 149437 | 6.38 | 29.17% |
| lzma | | 1000000000 | 291688775 | 161125 | 5.92 | 29.17% |

Very reasonable speed/compression tradeoff. All are single core performance on enwik9 - medium compressible text body. Obviously, I didn't have the time for testing a full VM image or something like the 10GB corpus with more mixed content.

It doesn't appear that lzma2 offers a great deal in its current implementation over standard lzma. Since you are dealing with small blocks the difference should be quite small.

Zstd has something like that built-in

Yes, as do lz4 and deflate, however, I have not seen it as fast as a dedicated function.

zstd is really impressive, no doubt. Benchmarks using the cgo implementation:

| level | insize | outsize | millis | mb/s | ratio |
|-------|------------|-----------|--------|--------|--------|
| 1 | 1000000000 | 358512492 | 5100 | 186.96 | 35.85% |
| 2 | 1000000000 | 332265582 | 6264 | 152.24 | 33.23% |
| 3 | 1000000000 | 314403327 | 8099 | 117.75 | 31.44% |
| 4 | 1000000000 | 310346439 | 8588 | 111.04 | 31.03% |
| 5 | 1000000000 | 305644452 | 12739 | 74.86 | 30.56% |
| 6 | 1000000000 | 292551252 | 18531 | 51.46 | 29.26% |
| 7 | 1000000000 | 287414827 | 23212 | 41.08 | 28.74% |
| 8 | 1000000000 | 282783804 | 27811 | 34.29 | 28.28% |
| 9 | 1000000000 | 280432907 | 31752 | 30.03 | 28.04% |

Forgive me if I missed something, but I didn't see quite these questions answered earlier.

  1. We seem to be talking about compression on the chunk level, not on the file level, correct?
  2. If so, that obviously limits the effectiveness since duplicated data in multiple chunks of a single file will be stored and compressed for each chunk.
  3. However, that obviously depends on the chunk size as well.
  4. So, what is the average chunk size? Seems like this is an important factor in how useful compression is.
  5. If the chunk size is fairly small, perhaps we should consider full-file, pre-chunking compression for highly compressible files (e.g. using @klauspost's estimator). For example, a 50 MB text file (e.g. log files, large Org-mode files, etc) is likely to be highly compressible as a single file. But if it's chunked first, and then each chunk is compressed individually, not sharing an index, that will greatly limit the effectiveness of the compression (IIUC).

Thanks.

If we would compress whole files, that could tamper with the de-duplication algorithm possibly making it less efficient.

Other than that let's not forget that any compression while offering tremendous advantages space-wise, opens us for a side-channel attack. From a size of a compressed data one can make an educated guess on the contents of the data. I think this was mentioned before, but still.

@alphapapa

We seem to be talking about compression on the chunk level, not on the file level, correct?

Yep, on the chunk level.

If so, that obviously limits the effectiveness since duplicated data in multiple chunks of a single file will be stored and compressed for each chunk. However, that obviously depends on the chunk size as well. So, what is the average chunk size? Seems like this is an important factor in how useful compression is.

We're aiming for 1MiB, but it can be as large as 8MiB.

If the chunk size is fairly small, perhaps we should consider full-file, pre-chunking compression for highly compressible files (e.g. using @klauspost's estimator). For example, a 50 MB text file (e.g. log files, large Org-mode files, etc) is likely to be highly compressible as a single file. But if it's chunked first, and then each chunk is compressed individually, not sharing an index, that will greatly limit the effectiveness of the compression (IIUC).

At first I'd like to integrate compression on the chunk level, and see how well that performs in real-life scenarios. We can revisit this idea later.

@klauspost Thanks a lot for taking the time to benchmark some algorithms/implementations and your recommendations, I appreciate it! While it would be nice to have zstd, I think not depending on cgo is much more important for the project as a whole. And using a compressibility estimator is a great idea, I love that.

The places you mentioned for adding compression/decompression sound good, but we need to track the metadata for that somewhere else. I think we'll probably add meaning to bits in the byte in the pack header, see http://restic.readthedocs.io/en/latest/100_references.html#pack-format. This is the part that needs to be done very carefully.

So, let me finish with #1494, then we'll see that this gets resolved.

@sanmai re: side-channels: I brought that up originally.
Various solutions were suggested, I personally would be content with:

  • having configuration options for whitelisting/blacklisting use of compression (similar to what we have for file inclusion)

Another idea was to try to hide the chunk boundaries in the packfiles, which would theoretically speaking make it harder, but I feel that you would still be able to time network writes, and side-channels like to which filesystem extent the chunk was written to and so on could be used to infer the boundaries, so I feel that the safest/easiest would be to just advise not compressing sensitive data.

This would be awesome! :beer: +$10

Just throwing it out there, but setting aside lzma or any of the more general compression algos, what about just run-length encoding or zero-squashing? Or would this not be sufficiently useful to enough people?

(I have a dog in this hunt, I often am backing up huge WAV files with lots of silence.)

+$15

Just throwing it out there, but setting aside lzma or any of the more general compression algos, what about just run-length encoding or zero-squashing? Or would this not be sufficiently useful to enough people?

Also useful for backing up VM drives with mostly empty space / sparse files (not sure if restic already supports backup/restoring sparse files)

@bherila restic does not support archiving/restoring sparse files yet, the files will be stored in the repo as if they just contain many zeroes. These large blocks of zeroes are deduplicated, so it'll not take up much space in the repo. For restore though, you will end up with a regular (non-sparse) file without "holes".

I just wanted to check, is there already some kind of compression? I've backed up several computers, including one with 50GB of data, and I get a much lower number on the server:

# du -shc /home/restic/
40G     /home/restic/
40G     total

@Alwaysin It's probably the deduplication, unless some files were excluded of course.

@rawtaz thank you, I wasn't aware about deduplication, must be that!

@iluvcapra squashing of large repeated blocks is already implemented through deduplication, as mentioned by @rawtaz.

@klauspost did you see this? https://github.com/mvdan/zstd

Yes, but honestly a stream decoder is the easy part. I have finished FSE en/decoding and have a Huffman encoder ready. Once the Huffman decoding is done a zstd stream decoder is pretty straightforward, with the full encoder being the final part.

LZ4 is perfectly sufficient and would also be a quick win.

Why not add lz4 and create another PR to support zstd?

Why not add lz4 and create another PR to support zstd?

@dave-fl because we need to be very careful when we modify the repository format. It must be done in a backwards-compatible way. The most important part of the whole project is the repo format, not the implementation. People depend on us to not screw up the format so they can restore their data :)

I think we cannot wait too much with compression. I just did some tests on few repositories of servers backups, i exactly win nothing when i gzip the repository ! Like @Alwaysin I already win 30% with deduplication.

About backwards-compatible way, you mean Restic should read both formats or a tools to migrate from the old one to the new one ? When Restic is not at v1.0.0 I believe it's ok to just migrate.

I just did some tests on few repositories of servers backups, i exactly win nothing when i gzip the repository!

Uhm, that's expected: All data in the repo is encrypted, so it is hardly compressible at all. If compression is used, it must be done on data before encrypting it.

I don't see how using LZ4 makes things non backwards compatible. Compression is compression. Why not support multiple formats?

You're right, i didn't think about that.
However when i gzip the source I don't win more than 30%, deduplication is already very efficient on a big directory with many duplicates. But of course with both it can be impressing.
With zpaq which does compression and deduplication I win just a little more, not so much.
I'm very open to test a branch with compression, doesn't matter if it's not compatible !

I don't see how using LZ4 makes things non backwards compatible. Compression is compression. Why not support multiple formats?

What happens if 2 clients are using the same repo but 1 of them uses an older version of restic which doesn't support compression? This feature needs to be designed carfully with all possible corner cases in mind.

I'd prefer no compression over a half working solution that possible breaks previous backups.

I think there has been enough discussion on the issue of adding compression. I can see it's a highly anticipated feature. I will tackle this next after finishing the new archiver code (see #1494).

Please don't add any further comments, thanks!

@dimejo What your are stating has nothing to do with what I've proposed. Whether you choose to implement zstd or lz4 would impact both cases.

I dare say that the CGO version of zstd looks somewhat portable :)

i looked a how feasible it would be to write a golang implementation of zstd, very briefly, based on the specification.

zstd mostly all in-house algorithms but (optionnally) relies on the xxHash-64 checksum for error checking, and there's golang port of that. since optional bits are, well, optional, you wouldn't have to implement those parts to get zstd support for a reader/writer in restic. zstd supports the concept of "dictionnaries" to optimize compression - i am not sure how that would interact with restict, but would be an interesting area of research to compress specific parts of the archive, e.g. JSON or metadata streams. otherwise that implementation could also be skipped as it is optional.

Where it gets trickier, of course, is where entropy coding kicks in. zstd uses a novel approach there called Finite State Entropy (FSE, a variation of [ANS](https://en.wikipedia.org/wiki/Asymmetric_numeral_systems#, of which only a C implementation exists as well. other entropy coding bits are implemented with huffman coding, of which there are several implementations, including two in the standard library: one in compress.flate and another in net.http2.hpack, which is rather odd.

From what I can tell, everything else is glue on top of that... Some huffman trees, sequences, frames and blocks. There are interesting properties in the way blocks and frames are built as well that might map well into restic blobs, which might make it possible to compress the repository as a whole while keeping blobs separate inside, although I haven't looked into that in details. It might also make coupling between the repository format and compression unacceptable.

zstd is largely more complicated than gzip or xzip, with about 70k lines of code (according to cloc) compared to 36k and 12k, respectively. that includes tests, however, which are numerous: when those are ignored, the implementation itself is about comparable with gzip (~34k).

so, in summary, it's just a matter of time before this is implemented in go. I believe such an engine could also leverage golang's parallelism because zstd "frames" are independent from each other. it's unclear to me, however, how frame are used: most streams i tested had only one (zstd /etc/motd) or two (zstd isos/Fedora-Workstation-Live-x86_64-27-1.6.iso) frames (as found by binwalk -R "\x28\xb5\x2f\xfd"), so maybe not such a gain there, because blocks are interrelated and less parallelizable...

anyways, all moot unless someone here wants to actually sit down and port that, but i figured i would share what i found while reading the thing... considering that zstd is an expansion of LZMA part of the LZ77 family of compressors, it shouldn't be unfeasible to port...

Any update about compression? I know that a lot of people want to wait for zstd, but what would be wrong with implementing lz4 or lzo or lzma?

If there was an update, this issue would be updated.

Let's try to respect the author's request in the meantime, though:

Please don't add any further comments, thanks!

@fd0 , just wanted to point out that there seems to be a pure Go implementation of zstd algorithm https://github.com/klauspost/compress/tree/master/zstd . I have not tried this myself. But this got me excited about the possibility of compression support in restic.

I don't know the Go zstd stuff (speed? code quality? maintenance?), but the C zstd stuff is about all a backup tool needs as it supports a wide range from fast/little to slower/high compression.

If we did not have already all the other compression algos (lz4, zlib, lzma) in borgbackup and would start adding compression now, guess we could live with just zstd and none.

As a matter of taste/preference, the default could be none (as it was before) or a very fast zstd level (that overall still makes most backups faster as there is less data to transfer).

Hello,
in my opinion compression is not a must to have feature for restic. I have compared the backup of my data done with Duplicati (with compression) and restic (without compression) and the overall used space was been really similar.
I need restic only to get a fast and reliable incremental backup. No need to break the bit...
The restore is important too and restic is suitable to disaster recovery. Duplicati is a nightmare because if you lost the local db the repair task takes days...

Thank you @fd0 and thanks to all contributors!

@filippobottega if you did not see a big difference in your experiment that either means:

  • that your data was not (much) compressible (but this is not the case in general), or
  • that duplicati had some compression-unrelated worse storage efficiency (e.g. due to different storage format, granularity, algorithms, whatever...), so the compression savings were compensated by losses in other areas.

both does not mean that compression is pointless.

@ThomasWaldmann I don't see a big difference for the first reason.
Data today are already compressed in many ways: docx, xlsx, pptx, zip, 7z, jpeg, tif and so on are all compressed formats. And also iso images contain compressed files. For this reason compression is pointless in restic I think.

@filippobottega Your view is a bit narrow-minded on what data people are using restic to backup. What about SQL dumps, source code, data sets, raw images and so on? De-duplication is doing a great job at reducing the delta size between backups, however it does nothing to reduce the original size of the data set. In case of uncompressed formats this could mean many gigabytes. Not to mention that storing an uncompressed format and then compressing + deduplicating could yield better results than deduplicating the already compressed files.

SQL dumps were my first thought, but restic also backs up my mail server and it seemingly gets better overall compression based on some RAR snapshots I took when moving from Duplicati to restic.

I can see the use-case for making compression optional and having a default list of file types, but compression would save me a reasonable amount of money.

@mrschyte

Your view is a bit narrow-minded on what data people are using restic to backup.

Now now, no need to get personal. His perspective is just as valid as yours and it's one worth considering. I've found that most data I back up is already compressed due to the file formats.

What about SQL dumps

Do you really store your SQL dumps uncompressed? I gz all mine before backing them up, because I have no need to store them raw.

source code, data sets, raw images and so on

I think the only valid use case for compressed backup is with large, uncompressed files with lots of repetition _which are actively being used and thus are not stored compressed already_. In my experience (which includes years of managing other people's data), very little data falls in this category. At least, not enough to make a huge difference in those cases.

however it does nothing to reduce the original size of the data set.

Arguably, that is not a backup program's job. It shouldn't touch the original data.

Not to mention that storing an uncompressed format and then compressing + deduplicating could yield better results than deduplicating the already compressed files.

Many compression algorithms rely on duplication to do their work (see flate's dictionaries), so I'm not convinced by this _in general_ though I agree this is right at least some times.

(I'm not saying that compression in restic is _bad_ when done correctly, I'm just arguing that it needn't be a priority -- especially compared to lingering performance issues -- and we should respect @fd0's time constraints and wishes with regard to vision.)

@mholt I would agree in general, however doing a root backup (via some dump or even iterating on the contents of /), yields a nice compression ratio for me. Not essential, as the total used is already small, but I get around 50% savings, and that's always nice to have for "free" as far as the end user is concerned.

Try this test.

  1. take SQL dump or some other uncompressed file. Compress it and then
    use restic back it up.
  2. remove a table from SQL database, take a second dump, then compress it,
    then use restic to back it up.

I believe you will find that because the compression is done BEFORE
deduplication you will almost entirely defeat restics dedup algorithm.
However, if restic could handle the compression AFTER deduplicating you
should get a much smaller overall output.

In the enterprise storage industry with tools like DataDomain it is always
recommended to feed the data to the storage device in an uncompressed
format and let the device do deduplication and then compression. The
general order that these tools should be applied is deduplication,
compression, then encryption. Think about this for a second....do you
really want to spend all the extra CPU compressing the same data multiple
times only for it to be deduped and essentially discarded anyway? Its
generally accepted that its best to reduce the dataset through dedup first
before expending the potentially heavy task of doing compression.

On Fri, Aug 2, 2019 at 1:29 PM Brandon Schneider notifications@github.com
wrote:

@mholt https://github.com/mholt I would agree in general, however doing
a root backup (via some dump or even iterating on the contents of /),
yields a nice compression ratio for me. Not essential, as the total
used is already small, but I get around 50% savings, and that's always nice
to have for "free" as far as the end user is concerned.

ā€”
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/restic/restic/issues/21?email_source=notifications&email_token=AC3I762ZVGTTJL4TF3ODZILQCRVIXA5CNFSM4AXPP352YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3OMAGA#issuecomment-517783576,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC3I767IQQD3CZBIWM37C6TQCRVIXANCNFSM4AXPP35Q
.

Do you really store your SQL dumps uncompressed? I gz all mine before backing them up, because I have no need to store them raw.
Arguably, that is not a backup program's job. It shouldn't touch the original data.

It feels like this kind of view hinders efficient storage of data? The idea that every program and export operation should implement its own adhoc compression format is something I'm trying to move away from, because it prevents deduplication/compression/etc from working on anything but a predefined per-file (or directory tarball) scope. Compressing files individually loses the ability to find commonality across different files/dumps/etc and subsequently you lose all benefits of deduplication. Keeping things uncompressed allows a filesystem (zfs, btrfs, etc) to do this all for you, and better since it can both compress and dedupe across folders, snapshots, etc. and abstract this all away while retaining compatibility with tools that need to work with the uncompressed data.

Compression can be viewed as just an additional optimization on top restic's deduplication, but they seem incompatible with each other if done separately... Suggesting that one should compress and preprocess files before backing them up takes everything back to a workflow where you may as well just use rsync/rclone instead, so why use restic in the first place?

It feels like this kind of view hinders efficient storage of data? The idea that every program and export operation should implement its own adhoc compression format is something I'm trying to move away from, because it prevents deduplication/compression/etc from working on anything but a predefined per-file (or directory tarball) scope. Compressing files individually loses the ability to find commonality across different files/dumps/etc and subsequently you lose all benefits of deduplication. Keeping things uncompressed allows a filesystem (zfs, btrfs, etc) to do this all for you, and better since it can both compress and dedupe across folders, snapshots, etc. and abstract this all away while retaining compatibility with tools that need to work with the uncompressed data.

It isnā€™t just efficient storage of data, itā€™s also the existing workflows. I want a backup product to backup my data reliably and efficiently, not enforce workflows across other aspects of the system. It is far more important for backups (which are retained, potentially indefinitely) to be stored efficiently whereas live working data should be stored in the best format for how it is actively used.

Now there are cases where it makes sense to compress before storing, especially with very compressible data on slow storage systems, but this is the exception more than the rule in my experience.

+1 compression would really help me! At work as a software engineer I back up my entire home folder, which has a lot of uncompressed source code (and in dynamic languages, like ruby or python, it's almost always source code -- even most dependencies).

At home I back up my entire / , again including many things that benefit from compression, such as binaries, man files, resource files, etc.

Of course I could compress all of them, and do many transformations before I back them up, but that would defeat a lot of the convenience of just running a very simple command, and being able to get a back-up, as well as easily restore things.

Now of course there are many classes of files that don't compress well, but nobody's saying that they should be forced to be compressed. There's many approaches out there to solve this -- whitelist which file types should be compressed, blacklist which files should not be, or even the simplest one: try to compress, and if the resulting size does not improve, store uncompressed (I believe ZFS uses this approach when on-disk compression is enabled).

In the end, compression is an example of the classic space vs time trade-off: do you want to pay more cpu, or more storage? In my case, storage dominates my costs, so I think it'd be great if my quad-core heated up a bit more, and then my file hosting bill was smaller.

Finally, I back up a little over 4tb to a cloud provider and my upload speed is weak anyway, so as a bonus compression would make my back up process faster, not slower -- my cpu can more than keep up with my poor VDSL connection.

Yep, i can just agree with all others here. Compression is pretty important and I see really no argument why restic shouldn't have it.

@mholt I completely agree with you. Every word.
In my tool chain compression comes before restic deduplication because, for example, I use TFS as source control and all sources are already compressed in SQL Backups and application images are compressed in msi setup files or 7z archives. I only need a fast and simple way to get the every day delta and send it to the cloud to implement a secure disaster recovery plan.
I think that @fd0 needs to focus his time to solve issues than trying to add other complexity to the product.

Just chiming in with a little comparison I did between borg using auto,zstd compression and restic (no compression), first on /, then on /home, excluding things like VM images and docker images (since I don't back them up in a real world backup either). The test machine was my daily software development machine, which contains many binary files, some compressed pictures and audio files, but also a fair amount of plaintext source code, which should compress quite nicely:

/: 1053136 files, 92.9 GiB

  • borg, none: 17:27 min, 64.1 GiB
  • borg, zstd: 19:29 min, 40.6 GiB
  • restic: 09:45 min, 62.4 GiB

/home: 221338 files, 58.3 GiB

  • borg, zstd: 09:06 min, 30.7 GiB
  • restic: 04:36 min, 39.4 GiB
    I omitted borg without compression here, since it is roughly the same as restic as far as storage space is concerned.

Firstly, I want to applaud restic for being almost exactly twice as fast on that test case. Apart from borg being slower, it might be of interest that the compression only adds ~2 minutes to the overall backup duration (+11%), but significantly reduces the data to be stored for the / case (-35%). In case of my home directory, the storage savings are roughly 20%.

(The test was performed to an external disk for that case. When backing up to a remote storage, the time the backup takes mostly depends on the upload bandwidth, at least when the CPU and IO speed are much higher than the network. I tested this and borg with compression is in fact faster than restic then, because the compression results in less data being transferred). All in all, I'm much in favor for restic gaining compression support, ideally using auto-detection for checking if a chunk benefits from compression.

@nioncode If my calculations are correct, you back up about 100/150MB/s. That is below what zstd can compress. Since compression is async there shouldn't really be any slowdown. It may even be a bit faster since there is less to write.

I know that archiving VMs may be a use case but I'm trying to avoid the need to do it.
I'm trying to automate the entire VM building starting from iso and setup files.
In case of disaster recovery I want to be able to restore the entire VM using the backup of setup files, documents and databases backup. And I'm trying to do it without user interaction.
In this way I can avoid the need to compress and backup a lot of trash files contained in a VM like temp files, uncompressed files like exe and dll, and so on.
I know, is not simple, but I can avoid to compress and deduplicate the same and unuseful GB of files saving disk space and bandwidth.

Let's not clutter this thread about who does things how. It's been enough.

Compression is a feature that many people want (myself included) and it can save both backup storage and upload time in case of slow-medium internet connectivity, for some people even 30% or more.

However, not everybody needs it and some people have adapted their workflow in a smart way to deal with it - or simply have the money or bandwidth or both to simply not care.

In any case, both sides have spoken.

@bjoe2k4 or are concerned about the negative security implications of compressing data before encrypting it, which gives away information about the plaintext data, as has also been brought up as an argument several times in this thread over the past few years. :)

Unless compression becomes mandatory then the security concerns of compression are simply a tradeoff a user can make. Iā€™ll take faster backups and reduced monthly and one-off costs over this theoretical risk (a risk which likely could not be exploited anyway as my data sets are large and the changes unpredictable, so the noise would drown out any attempt at generating a signal from compression).

I donā€™t believe anyone is talking about making compression mandatory though.

My special use case is backing up LARGE sets of CSV and SQL dumps. These files would be VERY compressable... and I don't want/cannt precompress them.

I really would like to have the compression feature since I pay for every GB online storage

Since this discussion is becoming a bit more active now I would like to share some findings I had with a patched restic version of some friends of mine. They added compression to restic (more or less quick and dirty as far as I know) and I will notify them about this post so they can comment on the implementation specifics if anyone is interested.
My use case is some really ugly banking software that has its own database format. We have to use this software for regulatory reasons and the data we have are several TB of rather big files that can be compressed down to 90% of their original size. So, quite obviously, compression would save us a lot in backup storage, backup times and restore times.
My findings when comparing restic upstream, the patched restic with compression and our current backup solution with tar can be found here: https://gist.github.com/joerg/b88bf1de0ce824894ffc38f597cfef5f

| Tool | Backup Time (m:s) | Restore Time (m:s) | Backup space (G) | Backup space (%) | Backup (MB/s) | Restore (MB/s) |
| --------------------------- | ----------------- | ------------------ | ---------------- | ---------------- | ------------- | -------------- |
| Tar | 4:42 | 5:19 | 11 | 9.6% | 404 | 357 |
| Restic S3 local Upstream | 10:04 | 30:56 | 102 | 89.5% | 189 | 61 |
| Restic S3 local Compress | 5:43 | 19:28 | 8.6 | 7.5% | 332 | 98 |
| Restic Local Upstream | 8:33 | 26:06 | 102 | 89.5% | 222 | 73 |
| Restic Local Compress | 5:21 | 16:57 | 8.6 | 7.5% | 355 | 112 |
| Restic S3 Remote Upstream | 17:12 | 46:06 | 102 | 89.5% | 110 | 41 |
| Restic S3 Remote Compress | 5:27 | 21:42 | 8.6 | 7.5% | 349 | 88 |

I think restic would gain massively with optional compression of any kind because it reduces pretty much everything.

Not every file will have an interesting compression ratio. Compressing a video file is probably worthless but compressing an SQL dump most certainly is. That's why filesystems like Btrfs first try to compress the first 128KB of a file and if there's a significant compression ratio will then compress the whole file. It's definitely not perfect but it's fast and should work for most use cases if it's decided to compress files individually.

For those that are arguing against providing compression as an option, my use case is that I back up a mixture of mostly compressible files types that I have no control of the content of and it is unreasonable to expect me to have to compress the data on multiple machines (which either takes up more local disk space in the case of compressing to a new archive or makes the files unusable by their associated applications if compressed in place) prior to performing a backup operation.

I'd prefer to be able to use restic as my DR backup tool but I'm currently using borg (slow, massive ram requirements, etc) because the resulting compression + dedupe it achieves saves me many gigabytes of network transfer per backup operation and easily over a terrabyte of storage space (which I pay for by the month) in the cloud over the entirety of my backup set. I would be able to retain backups for longer or reduce my storage costs if restic supported compression.

Hello @joerg , thank you for sharing your tests.
Have you tried to backup with restic the output of the Tar compression task?
I'm curious about comparing "Restic S3 Remote Compress" and "Tar" + "Restic S3 Remote Upstream".
Moreover what you says seems to be not really true:

I think restic would gain massively with optional compression of any kind because it reduces pretty much everything

Seeing the test results, it seems that the CPU time needed by restic is 2x longer for local backup and 6x longer on restore. Not really good compared to Tar.

tar is not a compression algorithm. of course it is fast.
EDIT: oh and btw. if you tar an directory it will not use multiple threads per file and it will also not work with two files or more at a time, instead it will scan the directory and add a file and then go to the next. quite slow. but the problem is the archive file which is not designed for adding with multiple threads.

Seeing the test results, it seems that the CPU time used by restic is 2x slower in local backup and 6x slower on restore. Not really good compared to Tar.

Iā€™m not completely sure of your point here. restic is slower than Tar, sure, but restic with compression is always faster than restic without, so restic would clearly benefit.

Tar is a useful comparison from a ā€œbest case on this hardwareā€, but it is lacking in most of resticā€™s other features (snapshots and data deduplication come to mind). Adding compression seems to only improve backup times, restore times and storage costs, all things which are important for a backup product.

@joerg Can your friends open a Pull Request and make their patches for restic with compression public available? Which compression algorithm do they use?

@joerg @thedaveCA
I apologies, I misunderstood the meaning of @joerg's assertion. Clearly restic with compression is better than restic without compression. Now my question is: Tar + restic is better or not compared to restic with compression?

Please mind that we are not using bare tar archives, but gzipped tar ones with a special parallel zip implementation, otherwise archiving terabytes of data would take days instead of the "just" hours it takes right now: https://gist.github.com/joerg/b88bf1de0ce824894ffc38f597cfef5f#tarpigz
@shibumi I informed them about this issue and my posting so it is now up to them if and in how far they want to be involved in this. Personally, I hope they will open that pull request...

Compression is a no go for encryption. It lets an attacker to make a guess if an encrypted repository contains a certain file as a section (chunk) of a file compresses to the same size independent of an encryption key used. This is a very well known vulnerability of encryption protocols, and that is why compression was removed from TLS 1.3.

Let's not create a known problem where's none, shall we?

(I think this problem was already mentioned, and probable not even once. Still this issue is open, where I feel like for this only reason it should be closed once and for all.)

Why are you spamming the issue? :( It has been discussed so many times that it's almost offtopic. You will not be FORCED to enable the compression!!

Moreover, I think your attack idea requires the attacker to be able to control the data to be compressed and encrypted (I am not sure though!). https://en.m.wikipedia.org/wiki/CRIME

But in any case, even if it is a security concern, someone may want to use compression only to a storage which is under his own control to simply save the storage space.

Having even an optional feature that weakens the encryption invites a false sense of security. Restic claims to be a _secure_ backup program. Adding an optional compression will void this promise as you can't be secure sometimes, only full time, all the time. And there will be CVE reports, for sure. Who wants for their software this type of "popularity"?

But I think that adding a compression in a way that it will never be used with encryption is a viable option.

FWIW in 2017 I made a demo where I stripped encryption from Restic, and showed that compression can be very effective. Hundred times effective. IIRC compression can be add as a some kind of wrapper just like encryption, but I haven't looked at the code for a long time so things may be harder these days, or easier.

actually CRIME needs to know the length of the ciphertext, which is basically impossible in restic.
also there is no "secure" backup program. if the backup files are accessible by thirds parties there is always a chance that somebody can tamper or worse, read the data.
so saying compression makes it worse, is just stupid.

actually CRIME needs to know the length of the ciphertext

CRIME needs but you don't. Imagine you're a investigative journalist given a set of top secret files by your source. You back them up with encryption and no one will know you have these files.

Now imagine you were not clever enough to enable compression, and now everyone else, who also happen to have these files, just judging by the sizes of compressed-then-encrypted chunks, will know that you have these top secret files in this archive without even needing to know the encryption key. This is very far from being secure. People can go to jail because of this "feature", get tortured, or worse.

there is no "secure" backup program

This then needs an update.

Fast, secure, efficient backup program

Also note secure-by-default.

restic stores only packed chunks, so the size of chunks is not evident to
someone not having the keys.

On Fri, Aug 09, 2019 at 02:09:23AM -0700, Alexey Kopytko wrote:

Compression is a no go for encryption. It lets an attacker to make a guess if an encrypted repository contains a certain file as a section (chunk) of a file compresses to the same size independent of an encryption key used. This is a very well known vulnerability of encryption protocols, that is why compression was removed from TLS 1.3.

Let's not create a known problem where's none, shall we?

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/restic/restic/issues/21#issuecomment-519844526

--
(Escriu-me xifrat si saps PGP / Write ciphered if you know PGP)
PGP key 7CBD1DA5 - https://emailselfdefense.fsf.org/

For who want to know more about these security concerns, there is a nice paper describing it http://www.iacr.org/cryptodb/archive/2002/FSE/3091/3091.pdf
In my understand there could be a flaw if the file is chunked and then compressed and encrypted. But if the file is compressed before chunking, it's a binary file like any other and those plaintext attack become useless.

But if the file is compressed before chunking, it's a binary file like any other and those plaintext attack become useless.

That is correct. But that'll not exactly help with efficient deduplication if I understand correctly since a compression algorithm may use a different vocabulary for each version of a file resulting in a very different binary result. Which obviously won't deduplicate. Otherwise put, it only makes sense to compress the resulting chunks.

restic stores only packed chunks, so the size of chunks is not evident to someone not having the keys

That's a relief.

My point still stands: that there are many ways one can add a hidden weakness into a program when it implements compression together with encryption, so best not to add one at all. Even encryption _experts_ deciding about TLS chose to remove compression. Guess they had a similar reasoning.

btw.:

However, it is important to note that these attacks have little security impact on, say, a bulkencryption application which compresses data before encrypting

...
also CRIME only works if you have multiple different versions of the encrypted files.
i.e. multiple backup runs (to different repositories, where the attacker obtained all of them)
and it also only works by a small amount of data.

CRIME needs but you don't. Imagine you're a investigative journalist given a set of top secret files by your source. You back them up with encryption and no one will know you have these files.

Now imagine you were not clever enough to enable compression, and now everyone else, who also happen to have these files, just judging by the sizes of compressed-then-encrypted chunks, will know that you have these top secret files in this archive without even needing to know the encryption key. This is very far from being secure. People can go to jail because of this "feature", get tortured, or worse.

that is bullshit. because it only works with a small sample size. it also can still be possible to go to jail without compression. especially at some point in time, when an attacker gained your backup files he might be able to brutforce them in the future.
there might be other security problems that appear in the future, etc...
the discussion just turned into meanigless fearmongering.

@sanmai, I do not get this example with

Imagine you're a investigative journalist ... Now imagine you were not clever enough to enable compression, and now everyone else, who also happen to have these files, just judging by the sizes of compressed-then-encrypted chunks, will know that you have these top secret files in this archive without even needing to know the encryption key.

What is meant? That someone can guess that an encrypted snapshot has these files just by looking at the size? This assumes that the files are compressed alone, or together with other known files. But then the same guess can be done with an unencrypted shapshot.

Actually, how about gzipping files before backing up? Does this open a security vulnurability too?

I think this example is plain nonsense: if you claim you can determine whether a snapshot contains compressed versions of some (arbitrary) files known to you, you could as well determine whether it contains those files uncompressed.

I do not believe compression can make encryption significantly less secure.

Most compression side-channel attacks involve several factors:
1) Attacker can control input
2) Attacker can observe the size of the output
3) Small changes to the input data result in measurable changes to the output size
4) Attacker can change the input and retry hundreds of thousands of times

Unlike web-based systems, in the vast majority involving restic backups, (1) and (2) will rarely hold at the same time. Furthermore, for block-based compression (3) is not really guaranteed, and for most backup regimes (4) certainly does not hold. Because backup frequency is usually once a day or so, it would take thousands of years to be able to manipulate data and monitor the compressed output size to notice any significant differences, and that's assuming that no other data is changing, which in most cases it would be.

If you were making backups where the output size were visible then you might want to consider disabling compression. Otherwise, there are really no practical attacks against it and it wouldn't make it less secure to have it enabled.

restic already does de-duplication which exposes it to the same theoretical attacks as compression side-channels anyway, and nobody has complained about this to my knowledge.

The fact is, there are hundreds or thousands of users who would benefit from a compression feature with no down-sides whatsoever. Can we please just leave this 5-year old issue to the developers who are working on it?

to be honest.... I prefer the concept of restic ... but I made tests in my usecase (lot of CSV files and SQL dumps) and had to switch to borg.

I tested with four generations of incremental backup and my files get a compression of 7:1 and together with deduplication I achieve > 20:1. I canĀ“t ignore that since already said I pay for my online backup storage per GB.

root@xxxx:~# borg list
2019-08-08_14:37                     Thu, 2019-08-08 14:37:10 [5e113a8102f2bd7e40d100343f849dc73843d145011c7214d5fa0895927eb6d1]
2019-08-08_22:28                     Thu, 2019-08-08 22:28:21 [17d815d000212a576610b2fd5688ab87cce00039bb89f63722c6a7819dec1821]
2019-08-09_02:00                     Fri, 2019-08-09 02:00:23 [217c53b07f30dfbca584c49468cfa624a2445a005890220509c97715f7007e81]
2019-08-10_02:00                     Sat, 2019-08-10 02:00:10 [5dd45b8ccf0aa382bf00d5b08e1d5d88daae014f0a1a42b3e2b0fc368623bba0]
root@xxxx:~# borg info
Repository ID: xxxx
Location: ssh://xxxx
Encrypted: Yes (repokey)
Cache: /var/lib/borg/cache/xxxx
Security dir: /var/lib/borg/security/xxxx
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
All archives:               69.02 GB             11.24 GB              2.80 GB

                       Unique chunks         Total chunks
Chunk index:                    9227                41812

What is meant? That someone can _guess_ that an encrypted snapshot has these files just by looking at the size? This assumes that the files are compressed alone, or together with other known files. But then the same guess can be done with an unencrypted shapshot.

Exactly. Slice a plain text file into equal pieces, compress then, then encrypt. Slice again, compress and encrypt. As sizes of encrypted files do not change AES-wise, you would see that in both cases you have a range sizes that match each other like a fingerprint. They (and by they I mean mainly administrations of oppressive regimes like Iran or Russia) can make a reasonable assumption that these files are present here, which therefore gives the reason, say, to continue to torture the suspect. I don't understand why y'all get so offended by these ideas, aren't they simple to understand? This ain't CRIME per se, is it?

But as noted before by @viric, technically Restic is not affected by these vulnerabilities as sizes of chunks are not seen without an encryption key. Yet if compression is added at some point, Restic may still not affected now, but may become affected later.

Does adding compression expose Restic to any additional vulnerability, given that it already does deduplication?

If your concern is an attacker guessing at compressed blocks sizes to infer the uncompressed size, okay, but does compression make this worse? Wouldnā€™t an attacker have the same basic information?

If an attacker could see the uncompressed AND compressed sizes of each file then identification might become more realistic, but this wouldnā€™t be possible in restic.

Ultimately the de-duplication already exposes you to every theoretical attack that I can see compression having an impact on, plus of course compression can be disabled to maintain the current state of affairs if this fits your situation better.

I simply donĀ“t understand why you discuss about hypothetical security concerns about guessing a file presence by seeing the size of an encrypted chunk..,,

You Guys use ZIP or GZ? Then you should be fine.

You think that iranian authorities can guess my content by sizes? Then just dont use compression(!). That does simply not mean that compression should not be available.

I think we've covered all relevant angles of adding compression to restic, thank you very much for all your input.

I think we should add compression and enable it by default, but allow users to disable compression. Please have patience until I have some more time to work on it.

It feels to me this discussion gets out of hand, so I'm locking this issue for now. If you like to continue this discussion, please head over to the forum. Thanks!

Was this page helpful?
0 / 5 - 0 ratings