Re: Add LZ4 compression in pg_dump

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>, gkokolatos(at)pm(dot)me
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, shiy(dot)fnst(at)fujitsu(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Rachel Heaton <rachelmheaton(at)gmail(dot)com>
Subject: Re: Add LZ4 compression in pg_dump
Date: 2023-03-01 16:39:54
Message-ID: 09b37949-cfe5-29bb-cbb1-498ee5700b61@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/27/23 05:49, Justin Pryzby wrote:
> On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
>> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
>>> I have some fixes (attached) and questions while polishing the patch for
>>> zstd compression. The fixes are small and could be integrated with the
>>> patch for zstd, but could be applied independently.
>>
>> One more - WriteDataToArchiveGzip() says:
>
> One more again.
>
> The LZ4 path is using non-streaming mode, which compresses each block
> without persistent state, giving poor compression for -Fc compared with
> -Fp. If the data is highly compressible, the difference can be orders
> of magnitude.
>
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
> 12351763
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> 21890708
>
> That's not true for gzip:
>
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
> 2118869
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
> 2115832
>
> The function ought to at least use streaming mode, so each block/row
> isn't compressioned in isolation. 003 is a simple patch to use
> streaming mode, which improves the -Fc case:
>
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
> 15178283
>
> However, that still flushes the compression buffer, writing a block
> header, for every row. With a single-column table, pg_dump -Fc -Z lz4
> still outputs ~10% *more* data than with no compression at all. And
> that's for compressible data.
>
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
> 12890296
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
> 11890296
>
> I think this should use the LZ4F API with frames, which are buffered to
> avoid outputting a header for every single row. The LZ4F format isn't
> compatible with the LZ4 format, so (unlike changing to the streaming
> API) that's not something we can change in a bugfix release. I consider
> this an Opened Item.
>
> With the LZ4F API in 004, -Fp and -Fc are essentially the same size
> (like gzip). (Oh, and the output is three times smaller, too.)
>
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
> 4155448
> $ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
> 4156548
>

Thanks. Those are definitely interesting improvements/optimizations!

I suggest we track them as a separate patch series - please add them to
the CF app (I guess you'll have to add them to 2023-07 at this point,
but we can get them in, I think).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Jones 2023-03-01 16:55:48 Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has)
Previous Message vignesh C 2023-03-01 16:29:13 Re: Time delayed LR (WAS Re: logical replication restrictions)