Re: Add LZ4 compression in pg_dump

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: gkokolatos(at)pm(dot)me
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, shiy(dot)fnst(at)fujitsu(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Rachel Heaton <rachelmheaton(at)gmail(dot)com>
Subject: Re: Add LZ4 compression in pg_dump
Date: 2023-02-27 04:49:10
Message-ID: 20230227044910.GO1653@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> > I have some fixes (attached) and questions while polishing the patch for
> > zstd compression. The fixes are small and could be integrated with the
> > patch for zstd, but could be applied independently.
>
> One more - WriteDataToArchiveGzip() says:

One more again.

The LZ4 path is using non-streaming mode, which compresses each block
without persistent state, giving poor compression for -Fc compared with
-Fp. If the data is highly compressible, the difference can be orders
of magnitude.

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
12351763
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
21890708

That's not true for gzip:

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
2118869
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
2115832

The function ought to at least use streaming mode, so each block/row
isn't compressioned in isolation. 003 is a simple patch to use
streaming mode, which improves the -Fc case:

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
15178283

However, that still flushes the compression buffer, writing a block
header, for every row. With a single-column table, pg_dump -Fc -Z lz4
still outputs ~10% *more* data than with no compression at all. And
that's for compressible data.

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
12890296
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
11890296

I think this should use the LZ4F API with frames, which are buffered to
avoid outputting a header for every single row. The LZ4F format isn't
compatible with the LZ4 format, so (unlike changing to the streaming
API) that's not something we can change in a bugfix release. I consider
this an Opened Item.

With the LZ4F API in 004, -Fp and -Fc are essentially the same size
(like gzip). (Oh, and the output is three times smaller, too.)

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
4155448
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
4156548

--
Justin

Attachment Content-Type Size
0001-f-fixes-for-LZ4.patch text/x-diff 3.2 KB
0002-f-fixes-for-LZ4-which-also-conflict-with-the-ZSTD-pa.patch text/x-diff 2.5 KB
0003-pg_dump-lz4-use-lz4-streaming-compression.patch text/x-diff 1.7 KB
0004-WIP-change-to-use-LZ4-frame-API.patch text/x-diff 5.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2023-02-27 04:55:45 Re: psql \watch 2nd argument: iteration count
Previous Message Pavel Stehule 2023-02-27 04:45:04 Re: Proposal: :SQL_EXEC_TIME (like :ROW_COUNT) Variable (psql)