Zstd: Better compression ratio if compression context periodically discarded

Created on 8 Jul 2019  ·  3Comments  ·  Source: facebook/zstd

I am consistently getting slightly better compression ratio if compression context is not being re-used.
I am creating ZSTD compression context, then in a loop calling ZSTD_compressCCtx, each time giving 1MB buffer with data. At the end of process, compression context is freed.
If I free compression context and create new one before proceeding to compress next 1MB buffer, output file size is consistently around 1% smaller.
Another interesting fact is that compression ratio around 1 - 1.5% better if I am using 2MB input buffers with data, comparing with 1MB input buffers.
In my use case I am not limited by memory resources.
Question(s)

  • Is it better practice to discard compression context between compressing large chunks of data?

    • What is recommended optimal input buffer size (i.e. decreasing buffer size will degrade compression ratio, while increasing buffer size will not improve compression ratio)?

    • Any way to tell zstd "use as much memory as you want but give me better compression ratio and/or speed"

    • Is it really streaming compression with context good only for memory-constrained use cases? If I have plenty of memory, I am better off with independently compressing large (>1MB) buffers?

question

All 3 comments

Hi @scherepanov ,

this outcome is surprising.
Using ZSTD_compressCCtx(), with same input and same compression level, it should not matter (from a compression ratio perspective) if the context is re-used or not. The only impact of re-using the context is to save allocation and initialization time, that's it. If it impacts compression ratio, it's strange, and it's probably wrong.

I would like to reproduce this scenario if that's possible. Which version are you using ?

Is it better practice to discard compression context between compressing large chunks of data?

You should never need to discard context.
The only "good" reason to is to simplify the code.
But from a performance perspective, it should be only beneficial, no downside.

What is recommended optimal input buffer size

This is highly situational. There is no "universal" threshold.
Generally speaking, beyond 8x window size, increasing chunk size is less and less valuable.
Window size, though, is a dynamic value, depending on compression level.
It varies from 512 KB (level 1) to 8 MB(level 19).

Any way to tell zstd "use as much memory as you want but give me better compression ratio"

Level 19 is supposed to be of this kind

and/or speed"

Level 4 is generally of this kind : it compresses pretty fast, but uses an outsized amount of memory. That's the closest I can think of.

Is it really streaming compression with context good only for memory-constrained use cases? If I have plenty of memory, I am better off with independently compressing large (>1MB) buffers?

Compressing / decompressing independent chunks in a single pass (ZSTD_compressCCtx() and ZSTD_decompressDCtx()) is just plain simpler, and likely as efficient as it can be. If you can do it, it's preferable. The streaming mode adds a lot of complexity on top of that. The complexity is mostly internal and hidden, but the main idea is, it can't be better / faster than the straightforward one-pass compression or decompression.

Thanks for very clarifying answer.
I traced different compression ratio to different ordering of my data. Yes, reusing context vs discarding does not make difference, exactly as you said. Sorry I should be more careful and investigate more before filing questions.
Your comments are very clear and very explaining. I think it really need to be added to docs. Especially part about difference of streaming vs non-streaming - I was always thought that streaming is more efficient, as you can build dictionary better (though not clear how do you modify dictionary when data changes down in a file). Clear understanding that streaming is pretty much same as "block-based" compression is very important. On other side, streaming may be more efficient, as you automatically handle chunk size. I am using 1MB chunk size with default compression level 3, and seem s to be insufficient to get better compression. From this point of view, streaming can be more efficient on compression ratio, as you will determine chunk size more optimally. (Is this correct???)

streaming is pretty much same as "block-based" compression

It's not exactly "the same".

If you cut input data into chunks, and pass them independently to ZSTD_compressCCtx(), you end up with multiple independent compressed chunks. Each compressed chunk is an independent _frame_. They can be decompressed in any order, because each frame is independent.

If you send the same data into a single stream, with ZSTD_compressStream(), without chunking, you end up with a _single frame_. Internally, the frame is cut into blocks, yes, but that doesn't matter, because blocks are not independent. To decode any part of the frame, it's necessary to decode everything from the beginning.

In theory, single frame should compress better than multiple independent frames. That's because cutting data into multiple independent chunks makes it lose some compression opportunity at the beginning of each chunk.
However, fast modes are merely "probabilistic" compressors, which make hasty bets in order to run fast. Not all opportunities are equal, and sometimes, selecting one opportunity just masks a later better opportunity. This is very data specific.
Therefore, in some rare cases, it may happen that cutting data into independent chunks ends up being competitive with a single stream.
But I wouldn't bet on that. In most case, single stream should win, if only by very little.

Was this page helpful?
0 / 5 - 0 ratings