Re: zstd compression for pg_dump

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Jacob Champion <jchampion(at)timescale(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, gkokolatos(at)pm(dot)me, Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dipesh Pandit <dipesh(dot)pandit(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Subject: Re: zstd compression for pg_dump
Date: 2023-03-17 02:43:31
Message-ID: 0115b682-58f0-ea43-9ecd-c37a4495812e@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/16/23 05:50, Justin Pryzby wrote:
> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion(at)timescale(dot)com> wrote:
>>> I did some smoke testing against zstd's GitHub release on Windows. To
>>> build against it, I had to construct an import library, and put that
>>> and the DLL into the `lib` folder expected by the MSVC scripts...
>>> which makes me wonder if I've chosen a harder way than necessary?
>>
>> It looks like pg_dump's meson.build is missing dependencies on zstd
>> (meson couldn't find the headers in the subproject without them).
>
> I saw that this was added for LZ4, but I hadn't added it for zstd since
> I didn't run into an issue without it. Could you check that what I've
> added works for your case ?
>
>>> Parallel zstd dumps seem to work as expected, in that the resulting
>>> pg_restore output is identical to uncompressed dumps and nothing
>>> explodes. I haven't inspected the threading implementation for safety
>>> yet, as you mentioned.
>>
>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>> handling safety for this, by isolating each thread's state. I don't feel
>> comfortable pronouncing this new addition safe or not, because I'm not
>> sure I understand what the comments in the format-specific _Clone()
>> callbacks are saying yet.
>
> My line of reasoning for unix is that pg_dump forks before any calls to
> zstd. Nothing zstd does ought to affect the pg_dump layer. But that
> doesn't apply to pg_dump under windows. This is an opened question. If
> there's no solid answer, I could disable/ignore the option (maybe only
> under windows).
>

I may be missing something, but why would the patch affect this? Why
would it even affect safety of the parallel dump? And I don't see any
changes to the clone stuff ...

>> On to code (not a complete review):
>>
>>> if (hasSuffix(fname, ".gz"))
>>> compression_spec.algorithm = PG_COMPRESSION_GZIP;
>>> else
>>> {
>>> bool exists;
>>>
>>> exists = (stat(path, &st) == 0);
>>> /* avoid unused warning if it is not built with compression */
>>> if (exists)
>>> compression_spec.algorithm = PG_COMPRESSION_NONE;
>>> -#ifdef HAVE_LIBZ
>>> - if (!exists)
>>> - {
>>> - free_keep_errno(fname);
>>> - fname = psprintf("%s.gz", path);
>>> - exists = (stat(fname, &st) == 0);
>>> -
>>> - if (exists)
>>> - compression_spec.algorithm = PG_COMPRESSION_GZIP;
>>> - }
>>> -#endif
>>> -#ifdef USE_LZ4
>>> - if (!exists)
>>> - {
>>> - free_keep_errno(fname);
>>> - fname = psprintf("%s.lz4", path);
>>> - exists = (stat(fname, &st) == 0);
>>> -
>>> - if (exists)
>>> - compression_spec.algorithm = PG_COMPRESSION_LZ4;
>>> - }
>>> -#endif
>>> + else if (check_compressed_file(path, &fname, "gz"))
>>> + compression_spec.algorithm = PG_COMPRESSION_GZIP;
>>> + else if (check_compressed_file(path, &fname, "lz4"))
>>> + compression_spec.algorithm = PG_COMPRESSION_LZ4;
>>> + else if (check_compressed_file(path, &fname, "zst"))
>>> + compression_spec.algorithm = PG_COMPRESSION_ZSTD;
>>> }
>>
>> This function lost some coherence, I think. Should there be a hasSuffix
>> check at the top for ".zstd" (and, for that matter, ".lz4")?
>

This was discussed in the lz4 thread a couple days, and I think there
should be hasSuffix() cases for lz4/zstd too, not just for .gz.

> The function is first checking if it was passed a filename which already
> has a suffix. And if not, it searches through a list of suffixes,
> testing for an existing file with each suffix. The search with stat()
> doesn't happen if it has a suffix. I'm having trouble seeing how the
> hasSuffix() branch isn't dead code. Another opened question.
>

AFAICS it's done this way because of this comment in pg_backup_directory

* ...
* ".gz" suffix is added to the filenames. The TOC files are never
* compressed by pg_dump, however they are accepted with the .gz suffix
* too, in case the user has manually compressed them with 'gzip'.

I haven't tried, but I believe that if you manually compress the
directory, it may hit this branch. And IMO if we support that for gzip,
the other compression methods should do that too for consistency.

In any case, it's a tiny amount of code and I don't feel like ripping
that out when it might break some currently supported use case.

>> I'm a little suspicious of the replacement of supports_compression()
>> with parse_compress_specification(). For example:
>>
>>> - errmsg = supports_compression(AH->compression_spec);
>>> - if (errmsg)
>>> + parse_compress_specification(AH->compression_spec.algorithm,
>>> + NULL, &compress_spec);
>>> + if (compress_spec.parse_error != NULL)
>>> {
>>> pg_log_warning("archive is compressed, but this installation does not support compression (%s
>>> - errmsg);
>>> - pg_free(errmsg);
>>> + compress_spec.parse_error);
>>> + pg_free(compress_spec.parse_error);
>>> }
>>
>> The top-level error here is "does not support compression", but wouldn't
>> a bad specification option with a supported compression method trip this
>> path too?
>
> No - since the 2nd argument is passed as NULL, it just checks whether
> the compression is supported. Maybe there ought to be a more
> direct/clean way to do it. But up to now evidently nobody needed to do
> that.
>

I don't think the patch can use parse_compress_specification() instead
of replace supports_compression(). The parsing simply determines if the
build has the library, it doesn't say if a particular tool was modified
to support the algorithm. I might build --with-zstd and yet pg_dump does
not support that algorithm yet.

Even after we add zstd to pg_dump, it's quite likely other compression
algorithms may not be supported by pg_dump from day 1.

I haven't looked at / tested the patch yet, but I wonder if you have any
thoughts regarding the size_t / int tweaks. I don't know what types zstd
library uses, how it reports errors etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-03-17 03:05:02 Re: Allow logical replication to copy tables in binary format
Previous Message shiy.fnst@fujitsu.com 2023-03-17 02:26:19 RE: Allow logical replication to copy tables in binary format