Quick Links

Re: zstd compression for pg_dump

From:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc:	Jacob Champion <jchampion(at)timescale(dot)com>, pgsql-hackers(at)postgresql(dot)org, gkokolatos(at)pm(dot)me, Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dipesh Pandit <dipesh(dot)pandit(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Subject:	Re: zstd compression for pg_dump
Date:	2023-03-27 17:28:42
Message-ID:	ZCHSSv3DgxUa0f3L@telsasoft.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> On 3/16/23 05:50, Justin Pryzby wrote:
> > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion(at)timescale(dot)com> wrote:
> >>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>> build against it, I had to construct an import library, and put that
> >>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>> which makes me wonder if I've chosen a harder way than necessary?
> >>
> >> It looks like pg_dump's meson.build is missing dependencies on zstd
> >> (meson couldn't find the headers in the subproject without them).
> >
> > I saw that this was added for LZ4, but I hadn't added it for zstd since
> > I didn't run into an issue without it. Could you check that what I've
> > added works for your case ?
> >
> >>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>> pg_restore output is identical to uncompressed dumps and nothing
> >>> explodes. I haven't inspected the threading implementation for safety
> >>> yet, as you mentioned.
> >>
> >> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >> handling safety for this, by isolating each thread's state. I don't feel
> >> comfortable pronouncing this new addition safe or not, because I'm not
> >> sure I understand what the comments in the format-specific _Clone()
> >> callbacks are saying yet.
> >
> > My line of reasoning for unix is that pg_dump forks before any calls to
> > zstd. Nothing zstd does ought to affect the pg_dump layer. But that
> > doesn't apply to pg_dump under windows. This is an opened question. If
> > there's no solid answer, I could disable/ignore the option (maybe only
> > under windows).
>
> I may be missing something, but why would the patch affect this? Why
> would it even affect safety of the parallel dump? And I don't see any
> changes to the clone stuff ...

zstd supports using threads during compression, with -Z zstd:workers=N.
When unix forks, the child processes can't do anything to mess up the
state of the parent processes.

But windows pg_dump uses threads instead of forking, so it seems
possible that the pg_dump -j threads that then spawn zstd threads could
"leak threads" and break the main thread. I suspect there's no issue,
but we still ought to verify that before declaring it safe.

> > The function is first checking if it was passed a filename which already
> > has a suffix. And if not, it searches through a list of suffixes,
> > testing for an existing file with each suffix. The search with stat()
> > doesn't happen if it has a suffix. I'm having trouble seeing how the
> > hasSuffix() branch isn't dead code. Another opened question.
>
> AFAICS it's done this way because of this comment in pg_backup_directory
>
> * ...
> * ".gz" suffix is added to the filenames. The TOC files are never
> * compressed by pg_dump, however they are accepted with the .gz suffix
> * too, in case the user has manually compressed them with 'gzip'.
>
> I haven't tried, but I believe that if you manually compress the
> directory, it may hit this branch.

That would make sense, but when I tried, it didn't work like that.
The filenames were all uncompressed names. Maybe it worked differently
in an older release. Or maybe it changed during development of the
parallel-directory-dump patch and it's actually dead code.

This is rebased over the updated compression API.

It seems like I misunderstood something you said before, so now I put
back "supports_compression()".

--
Justin

Attachment	Content-Type	Size
0001-pg_dump-zstd-compression.patch	text/x-diff	29.9 KB
0002-TMP-pg_dump-use-Zstd-by-default-for-CI-only.patch	text/x-diff	2.2 KB
0003-zstd-support-long-distance-mode-in-pg_dump-basebacku.patch	text/x-diff	12.7 KB

In response to

Re: zstd compression for pg_dump at 2023-03-17 02:43:31 from Tomas Vondra

Responses

Re: zstd compression for pg_dump at 2023-03-28 16:23:26 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2023-03-27 17:43:07	Re: HOT chain validation in verify_heapam()
Previous Message	Andres Freund	2023-03-27 17:18:50	Re: meson/msys2 fails with plperl/Strawberry