Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

From: Dimitrios Apostolou <jimis(at)gmx(dot)net>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
Date: 2025-07-28 13:34:20
Message-ID: 1po8os49-r63o-2923-p37n-12698o1qn7p0@tzk.arg
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Saturday 2025-06-14 18:17, Dimitrios Apostolou wrote:

> Out of curiosity I've tried the same with an uncompressed dump
> (--compress=none). Surprisingly it seems the blocksize is even smaller.
>
> With my patched pg_restore I only get 4K reads and nothing else on the strace
> output.
>
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096

To clarify this output again, I have a huge uncompressed custom format
dump without TOC (because pg_dump was writing to stdout), and at this
point pg_restore is going through the whole archive to find the items it
needs. Allow me to explain what goes on at this point since I have
better insight now.

The code in question, in pg_backup_custom.c:

/*
* Skip data from current file position.
* Data blocks are formatted as an integer length, followed by data.
* A zero length indicates the end of the block.
*/
static void
_skipData(ArchiveHandle *AH)
{
lclContext *ctx = (lclContext *) AH->formatData;
size_t blkLen;
char *buf = NULL;
int buflen = 0;

blkLen = ReadInt(AH);
while (blkLen != 0)
{
/* Sequential access is usually faster, so avoid seeking if the
* jump forward is shorter than 1MB. */
if (ctx->hasSeek && blkLen > 1024 * 1024)
{
if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
pg_fatal("error during file seek: %m");
}
else
{
if (blkLen > buflen)
{
free(buf);
buf = (char *) pg_malloc(blkLen);
buflen = blkLen;
}
if (fread(buf, 1, blkLen, AH->FH) != blkLen)
{
if (feof(AH->FH))
pg_fatal("could not read from input file: end of file");
else
pg_fatal("could not read from input file: %m");
}
}

blkLen = ReadInt(AH);
}

free(buf);
}

blkLen is almost always a number around 35 to 38.
So fread() is called all the time doing reads of about ~35 bytes.
Then ReadInt() is actually doing getc() a few times.
And it loops over.

Libc is doing buffering of 4k, and that's how we end up seeing so many
4k reads. This also explains the ~80 lseek() between each 4k read() on
the unpatched version, mentioned in previous email.

I've tried setvbuf() like Thomas Munro suggested and I saw a significant
improvement by allocating and using a 1MB buffer for libc stream
buffering.

Question that remains: where is pg_dump setting this ~35B length block?
Is that easy to change without breaking old versions?

Thanks in advance,
Dimitris

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-07-28 13:58:08 Re: [PATCH] avoid double scanning in function byteain
Previous Message Andrei Lepikhov 2025-07-28 13:29:06 Re: Pushing down a subquery relation's ppi_clauses, and more ...