Re: zstd compression for pg_dump

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Jacob Champion <jchampion(at)timescale(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, gkokolatos(at)pm(dot)me, Michael Paquier <michael(at)paquier(dot)xyz>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dipesh Pandit <dipesh(dot)pandit(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Subject: Re: zstd compression for pg_dump
Date: 2023-03-16 04:50:15
Message-ID: ZBKgBz+4blcKru6x@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion(at)timescale(dot)com> wrote:
> > I did some smoke testing against zstd's GitHub release on Windows. To
> > build against it, I had to construct an import library, and put that
> > and the DLL into the `lib` folder expected by the MSVC scripts...
> > which makes me wonder if I've chosen a harder way than necessary?
>
> It looks like pg_dump's meson.build is missing dependencies on zstd
> (meson couldn't find the headers in the subproject without them).

I saw that this was added for LZ4, but I hadn't added it for zstd since
I didn't run into an issue without it. Could you check that what I've
added works for your case ?

> > Parallel zstd dumps seem to work as expected, in that the resulting
> > pg_restore output is identical to uncompressed dumps and nothing
> > explodes. I haven't inspected the threading implementation for safety
> > yet, as you mentioned.
>
> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> handling safety for this, by isolating each thread's state. I don't feel
> comfortable pronouncing this new addition safe or not, because I'm not
> sure I understand what the comments in the format-specific _Clone()
> callbacks are saying yet.

My line of reasoning for unix is that pg_dump forks before any calls to
zstd. Nothing zstd does ought to affect the pg_dump layer. But that
doesn't apply to pg_dump under windows. This is an opened question. If
there's no solid answer, I could disable/ignore the option (maybe only
under windows).

> On to code (not a complete review):
>
> > if (hasSuffix(fname, ".gz"))
> > compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > else
> > {
> > bool exists;
> >
> > exists = (stat(path, &st) == 0);
> > /* avoid unused warning if it is not built with compression */
> > if (exists)
> > compression_spec.algorithm = PG_COMPRESSION_NONE;
> > -#ifdef HAVE_LIBZ
> > - if (!exists)
> > - {
> > - free_keep_errno(fname);
> > - fname = psprintf("%s.gz", path);
> > - exists = (stat(fname, &st) == 0);
> > -
> > - if (exists)
> > - compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > - }
> > -#endif
> > -#ifdef USE_LZ4
> > - if (!exists)
> > - {
> > - free_keep_errno(fname);
> > - fname = psprintf("%s.lz4", path);
> > - exists = (stat(fname, &st) == 0);
> > -
> > - if (exists)
> > - compression_spec.algorithm = PG_COMPRESSION_LZ4;
> > - }
> > -#endif
> > + else if (check_compressed_file(path, &fname, "gz"))
> > + compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > + else if (check_compressed_file(path, &fname, "lz4"))
> > + compression_spec.algorithm = PG_COMPRESSION_LZ4;
> > + else if (check_compressed_file(path, &fname, "zst"))
> > + compression_spec.algorithm = PG_COMPRESSION_ZSTD;
> > }
>
> This function lost some coherence, I think. Should there be a hasSuffix
> check at the top for ".zstd" (and, for that matter, ".lz4")?

The function is first checking if it was passed a filename which already
has a suffix. And if not, it searches through a list of suffixes,
testing for an existing file with each suffix. The search with stat()
doesn't happen if it has a suffix. I'm having trouble seeing how the
hasSuffix() branch isn't dead code. Another opened question.

> I'm a little suspicious of the replacement of supports_compression()
> with parse_compress_specification(). For example:
>
> > - errmsg = supports_compression(AH->compression_spec);
> > - if (errmsg)
> > + parse_compress_specification(AH->compression_spec.algorithm,
> > + NULL, &compress_spec);
> > + if (compress_spec.parse_error != NULL)
> > {
> > pg_log_warning("archive is compressed, but this installation does not support compression (%s
> > - errmsg);
> > - pg_free(errmsg);
> > + compress_spec.parse_error);
> > + pg_free(compress_spec.parse_error);
> > }
>
> The top-level error here is "does not support compression", but wouldn't
> a bad specification option with a supported compression method trip this
> path too?

No - since the 2nd argument is passed as NULL, it just checks whether
the compression is supported. Maybe there ought to be a more
direct/clean way to do it. But up to now evidently nobody needed to do
that.

> > +static void
> > +ZSTD_CCtx_setParam_or_die(ZSTD_CStream *cstream,
> > + ZSTD_cParameter param, int value, char *paramname)
>
> IMO we should avoid stepping on the ZSTD_ namespace with our own
> internal function names.

done

> > + if (cs->readF != NULL)
> > +
> > + if (cs->writeF != NULL)
>
> This seems to suggest that both cs->readF and cs->writeF could be set,
> but in that case, the output buffer gets reallocated.

I put back an assertion that exactly one of them was set, since that's
true of how it currently works.

> I was curious about the extra byte allocated in the decompression case.
> I see that ReadDataFromArchiveZstd() is null-terminating the buffer
> before handing it to ahwrite(), but why does it need to do that?

I was trying to figure that out, too. I think the unterminated case
might be for ExecuteSqlCommandBuf(), and that may only (have) been
needed to allow pg_restore to handle ancient/development versions of
pg_dump... It's not currently hit.
https://coverage.postgresql.org/src/bin/pg_dump/pg_backup_db.c.gcov.html#470

I found that the terminator was added for the uncompressed case was
added at e8f69be05 and removed in bf9aa490d.

> > +Zstd_get_error(CompressFileHandle *CFH)
>
> Seems like this should be using the zstderror stored in the handle?

Yes - I'd already addressed that locally.

> In ReadDataFromArchiveZstd():
>
> > + * Read compressed data. Note that readF can resize the buffer; the
> > + * new size is tracked and used for future loops.
> This is pretty complex for what it's doing. I'm a little worried that we
> let the reallocated buffer escape to the caller while losing track of
> how big it is. I think that works today, since there's only ever one
> call per handle, but any future refactoring that allowed cs->readData()
> to be called more than once would subtly break this code.

Note that nothing bad happens if we lose track of how big it is (well,
assuming that readF doesn't *shrink* the buffer).

The previous patch version didn't keep track of its new size, and the only
consequence is that readF() might re-resize it again on a future iteration,
even if it was already sufficiently large.

When I originally wrote it (and up until that patch version), I left
this as an XXX comment about reusing the resized buffer. But it seemed
easy enough to fix so I did.

> In ZstdWriteCommon():
>
> Elsewhere, output->pos is set to zero before compressing, but here we do
> it after, which I think leads to subtle differences in the function
> preconditions. If that's an intentional difference, can the reason be
> called out in a comment?

It's not deliberate. I think it had no effect, but changed - thanks.

--
Justin

Attachment Content-Type Size
0001-pg_dump-zstd-compression.patch text/x-diff 30.9 KB
0002-TMP-pg_dump-use-Zstd-by-default-for-CI-only.patch text/x-diff 2.2 KB
0003-zstd-support-long-distance-mode-in-pg_dump-basebacku.patch text/x-diff 13.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-03-16 05:10:05 Re: Add macros for ReorderBufferTXN toptxn
Previous Message Amit Kapila 2023-03-16 04:30:24 Re: Simplify some codes in pgoutput