Re: zstd compression for pg_dump

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Jacob Champion <jchampion(at)timescale(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, gkokolatos(at)pm(dot)me, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dipesh Pandit <dipesh(dot)pandit(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Subject: Re: zstd compression for pg_dump
Date: 2023-03-03 18:55:46
Message-ID: 20230303185546.GE12850@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 03, 2023 at 10:32:53AM -0800, Jacob Champion wrote:
> On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > This resolves cfbot warnings: windows and cppcheck.
> > And refactors zstd routines.
> > And updates docs.
> > And includes some fixes for earlier patches that these patches conflicts
> > with/depends on.
>
> This'll need a rebase (cfbot took a while to catch up).

Soon.

> The patchset includes basebackup modifications, which are part of a
> different CF entry; was that intended?

Yes, it's intentional - if zstd:long mode were to be merged first, then
this patch should include long mode from the start.
Or, if pgdump+zstd were merged first, then long mode could be added to
both places.

> I tried this on a local, 3.5GB, mostly-text table (from the UK Price

Thanks for looking. If your zstd library is compiled with thread
support, could you also try with :workers=N ? I believe this is working
correctly, but I'm going to ask for help verifying that...

It'd be especially useful to test under windows, where pgdump/restore
use threads instead of forking... If you have a windows environment but
not set up for development, I think it's possible to get cirrusci to
compile a patch for you and then retrieve the binaries provided as an
"artifact" (credit/blame for this idea should be directed to Thomas
Munro).

> With this particular dataset, I don't see much improvement with
> zstd:long.

Yeah. I this could be because either 1) you already got very good
comprssion without looking at more data; and/or 2) the neighboring data
is already very similar, maybe equally or more similar, than the further
data, from which there's nothing to gain.

> (At nearly double the CPU time, I get a <1% improvement in
> compression size.) I assume it's heavily data dependent, but from the
> notes on --long [2] it seems like they expect you to play around with
> the window size to further tailor it to your data. Does it make sense
> to provide the long option without the windowLog parameter?

I don't want to start exposing lots of fine-granined parameters at this
point. In the immediate case, it looks like it may require more than
just adding another parameter:

Note: If windowLog is set to larger than 27,
--long=windowLog or --memory=windowSize needs to be passed to the
decompressor.

--
Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeroen Vermeulen 2023-03-03 19:30:37 Re: libpq: PQgetCopyData() and allocation overhead
Previous Message Tom Lane 2023-03-03 18:53:30 Re: libpq: PQgetCopyData() and allocation overhead