Re: zstd compression for pg_dump

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Jacob Champion <jchampion(at)timescale(dot)com>, pgsql-hackers(at)postgresql(dot)org, gkokolatos(at)pm(dot)me, Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dipesh Pandit <dipesh(dot)pandit(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Subject: Re: zstd compression for pg_dump
Date: 2023-03-28 16:23:26
Message-ID: 0022dd9e-a439-d72f-d09b-ec512c23949b@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/27/23 19:28, Justin Pryzby wrote:
> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>> On 3/16/23 05:50, Justin Pryzby wrote:
>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion(at)timescale(dot)com> wrote:
>>>>> I did some smoke testing against zstd's GitHub release on Windows. To
>>>>> build against it, I had to construct an import library, and put that
>>>>> and the DLL into the `lib` folder expected by the MSVC scripts...
>>>>> which makes me wonder if I've chosen a harder way than necessary?
>>>>
>>>> It looks like pg_dump's meson.build is missing dependencies on zstd
>>>> (meson couldn't find the headers in the subproject without them).
>>>
>>> I saw that this was added for LZ4, but I hadn't added it for zstd since
>>> I didn't run into an issue without it. Could you check that what I've
>>> added works for your case ?
>>>
>>>>> Parallel zstd dumps seem to work as expected, in that the resulting
>>>>> pg_restore output is identical to uncompressed dumps and nothing
>>>>> explodes. I haven't inspected the threading implementation for safety
>>>>> yet, as you mentioned.
>>>>
>>>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>>>> handling safety for this, by isolating each thread's state. I don't feel
>>>> comfortable pronouncing this new addition safe or not, because I'm not
>>>> sure I understand what the comments in the format-specific _Clone()
>>>> callbacks are saying yet.
>>>
>>> My line of reasoning for unix is that pg_dump forks before any calls to
>>> zstd. Nothing zstd does ought to affect the pg_dump layer. But that
>>> doesn't apply to pg_dump under windows. This is an opened question. If
>>> there's no solid answer, I could disable/ignore the option (maybe only
>>> under windows).
>>
>> I may be missing something, but why would the patch affect this? Why
>> would it even affect safety of the parallel dump? And I don't see any
>> changes to the clone stuff ...
>
> zstd supports using threads during compression, with -Z zstd:workers=N.
> When unix forks, the child processes can't do anything to mess up the
> state of the parent processes.
>
> But windows pg_dump uses threads instead of forking, so it seems
> possible that the pg_dump -j threads that then spawn zstd threads could
> "leak threads" and break the main thread. I suspect there's no issue,
> but we still ought to verify that before declaring it safe.
>

OK. I don't have access to a Windows machine so I can't test that. Is it
possible to disable the zstd threading, until we figure this out?

>>> The function is first checking if it was passed a filename which already
>>> has a suffix. And if not, it searches through a list of suffixes,
>>> testing for an existing file with each suffix. The search with stat()
>>> doesn't happen if it has a suffix. I'm having trouble seeing how the
>>> hasSuffix() branch isn't dead code. Another opened question.
>>
>> AFAICS it's done this way because of this comment in pg_backup_directory
>>
>> * ...
>> * ".gz" suffix is added to the filenames. The TOC files are never
>> * compressed by pg_dump, however they are accepted with the .gz suffix
>> * too, in case the user has manually compressed them with 'gzip'.
>>
>> I haven't tried, but I believe that if you manually compress the
>> directory, it may hit this branch.
>
> That would make sense, but when I tried, it didn't work like that.
> The filenames were all uncompressed names. Maybe it worked differently
> in an older release. Or maybe it changed during development of the
> parallel-directory-dump patch and it's actually dead code.
>

Interesting. Would be good to find out. I wonder if a little bit of
git-log digging could tell us more. Anyway, until we confirm it's dead
code, we should probably do what .gz does and have the same check for
.lz4 and .zst files.

> This is rebased over the updated compression API.
>
> It seems like I misunderstood something you said before, so now I put
> back "supports_compression()".
>

Thanks! I need to do a bit more testing and review, but it seems pretty
much RFC to me, assuming we can figure out what to do about threading.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-03-28 16:34:39 Re: logical decoding and replication of sequences, take 2
Previous Message Robert Haas 2023-03-28 16:13:27 Re: running logical replication as the subscription owner