Quick Links

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Dimitrios Apostolou <jimis(at)gmx(dot)net>
Cc:	Nathan Bossart <nathandbossart(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
Date:	2025-10-10 20:26:57
Message-ID:	3515357.1760128017@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Dimitrios Apostolou <jimis(at)gmx(dot)net> writes:
> Question that remains: where is pg_dump setting this ~35B length block?

I poked into that question, and found that the answer is some
exceedingly brain-dead buffering logic in compress_zstd.c.
It will dump its buffer during every loop iteration within
_ZstdWriteCommon, no matter how much buffer space it has left;
and each call to cs->writeF() produces a new "data block"
in the output file. The amount of data fed to _ZstdWriteCommon
per call is whatever the backend sends per "copy data" message,
which is generally one table row. So if the table rows aren't
too wide, or if they're highly compressible, you get these
tiny data blocks.

compress_lz4.c is equally broken, again buffering no bytes across
calls; although liblz4 seems to do some buffering internally.
I got blocks of around 300 bytes on the test case I was using.
That's still ridiculous.

compress_gzip.c is actually sanely implemented, and consistently
produces blocks of 4096 bytes, which traces to DEFAULT_IO_BUFFER_SIZE
in compress_io.h.

If you choose --compress=none, you get data blocks that correspond
exactly to table rows. We could imagine doing some internal
buffering to amalgamate short rows into larger blocks, but I'm
not entirely convinced it's worth messing with that case.

The attached patch fixes the buffering logic in compress_zstd.c
and compress_lz4.c. For zstd, most blocks are now 131591 bytes,
which seems to be determined by ZSTD_CStreamOutSize() not by
our code. For lz4, I see a range of block sizes but they're
almost all around 64K. That's apparently emergent from the
behavior of LZ4F_compressBound(): when told we want to supply
it up to 4K at a time, it says it needs a buffer around 64K.

I'm tempted to increase DEFAULT_IO_BUFFER_SIZE so that gzip
also produces blocks in the vicinity of 64K, but we'd have
to decouple the behavior of compress_lz4.c somehow, or it
would want to produce blocks around a megabyte which might
be excessive. (Or if it's not, we'd still want all these
compression methods to choose similar block sizes, I'd think.)

Anyway, these fixes should remove the need for pg_restore to look
at quite so many places in the archive file. There may still be
a need for altering the seek-versus-read behavior as you suggest,
but I think we need to re-measure that tradeoff after fixing the
pg_dump side.

regards, tom lane

Attachment	Content-Type	Size
v1-improve-pg_dump-compress-buffering.patch	text/x-diff	9.8 KB

In response to

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward at 2025-07-28 13:34:20 from Dimitrios Apostolou

Responses

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward at 2025-10-11 02:22:16 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bryan Green	2025-10-10 21:20:13	Re: Buf fix: update-po for PGXS does not work
Previous Message	Robert Haas	2025-10-10 20:24:51	Re: another autovacuum scheduling thread