Re: zstd compression for pg_dump

From: Jacob Champion <jchampion(at)timescale(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, gkokolatos(at)pm(dot)me, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dipesh Pandit <dipesh(dot)pandit(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Subject: Re: zstd compression for pg_dump
Date: 2023-03-03 18:32:53
Message-ID: CAAWbhmgEpYPn2sjxr0Kar_XKS4dzyEnussd0emwe2YHxb_tk6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> This resolves cfbot warnings: windows and cppcheck.
> And refactors zstd routines.
> And updates docs.
> And includes some fixes for earlier patches that these patches conflicts
> with/depends on.

This'll need a rebase (cfbot took a while to catch up). The patchset
includes basebackup modifications, which are part of a different CF
entry; was that intended?

I tried this on a local, 3.5GB, mostly-text table (from the UK Price
Paid dataset [1]) and the comparison against the other methods was
impressive. (I'm no good at constructing compression benchmarks, so
this is a super naive setup. Client's on the same laptop as the
server.)

$ time ./src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
zstd > /tmp/zstd.dump
real 1m17.632s
user 0m35.521s
sys 0m2.683s

$ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
lz4 > /tmp/lz4.dump
real 1m13.125s
user 0m19.795s
sys 0m3.370s

$ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
gzip > /tmp/gzip.dump
real 2m24.523s
user 2m22.114s
sys 0m1.848s

$ ls -l /tmp/*.dump
-rw-rw-r-- 1 jacob jacob 1331493925 Mar 3 09:45 /tmp/gzip.dump
-rw-rw-r-- 1 jacob jacob 2125998939 Mar 3 09:42 /tmp/lz4.dump
-rw-rw-r-- 1 jacob jacob 1215834718 Mar 3 09:40 /tmp/zstd.dump

Default gzip was the only method that bottlenecked on pg_dump rather
than the server, and default zstd outcompressed it at a fraction of
the CPU time. So, naively, this looks really good.

With this particular dataset, I don't see much improvement with
zstd:long. (At nearly double the CPU time, I get a <1% improvement in
compression size.) I assume it's heavily data dependent, but from the
notes on --long [2] it seems like they expect you to play around with
the window size to further tailor it to your data. Does it make sense
to provide the long option without the windowLog parameter?

Thanks,
--Jacob

[1] https://landregistry.data.gov.uk/
[2] https://github.com/facebook/zstd/releases/tag/v1.3.2

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-03-03 18:46:37 Re: libpq-fe.h should compile *entirely* standalone
Previous Message Jeroen Vermeulen 2023-03-03 18:25:48 Re: libpq: PQgetCopyData() and allocation overhead