[PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward

From: Dimitrios Apostolou <jimis(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward
Date: 2025-10-20 18:40:27
Message-ID: 9opr64ps-625r-667n-q19o-op35rs414n59@tzk.arg
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday 2025-10-16 19:01, Tom Lane wrote:

>> I think this is more or less committable, and then we could get
>> back to the original question of whether it's worth tweaking
>> pg_restore's seek-vs-scan behavior.
>
> And done. Dimitrios, could you re-do your testing against current
> HEAD, and see if there's still a benefit to tweaking pg_restore's
> seek-vs-read decisions, and if so what's the best number?

Sorry for the delay, I hadn't realized a needed to generate a new
database dump using the current HEAD. So I did that, using
--compress=none and storing it on compressed btrfs filesystem, since
that's my primary use case.

I notice that things have improved immensely!
Using the test you suggested (see NOTE1):

pg_restore -t last_table -f /dev/null huge.pg_dump

1. The strace output is much more reasonable now; basically it's
repeating the pattern

read(4k)
lseek(~128k forward)

As a reminder, with old archives it was repeating the pattern:

read(4k)
lseek(4k forward)
lseek(same offset as above) x ~80 times

2. The IO speed is better than before:

On my 20TB HDD I get 30-50 MB/s read rate.

With old archives I get 10-20 MB/s read rate.

3. Time to complete: ~25 min

4. CPU usage is low. With old archives the pg_restore process shows
high *system* CPU (because of the amount of syscalls).

I can't really compare the actual runtime between old and new dump,
because the two dumps are very different. But I have no doubt the new
dump is several times faster to seek through.

NOTE1: My original testcase was

pg_restore -t last_table -j $NCPU -d testdb

This testcase does not show as big improvement,
because every single of the parallel workers is
concurrently seeking through the dump file.

*** All above was measured from master branch HEAD **
277dec6514728e2d0d87c1279dd5e0afbf897428
Don't rely on zlib's gzgetc() macro.

*** Below I have applied attached patch ***

Regarding the attached patch (rebased and edited commit message), it
basically replaces seek(up to 1MB forward) with read(). The 1MB number
comes a bit out of the top of my head. But tweaking it between 128KB and
1MB wouldn't really change anything, given that the block size is now
128KB: The read() will always be chosen against the seek(). Do you know
of a real-world case with block sizes >128KB?

Anyway I tried it with the new archive from above.

1. strace output is a loop of the following:

read(4k)
read(~128k)

2. Read rate is between 150-250MB/s basically max that the HDD can give.

3. Time to complete: ~5 min

4. CPU usage: HIGH (63%), most likely because of the sheer amount
of data it's parsing.

Regards,
Dimitris

Attachment Content-Type Size
v4-0001-parallel-pg_restore-avoid-disk-seeks-when-moving-.patch text/x-patch 2.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniil Davydov 2025-10-20 19:03:35 Re: Accessing an invalid pointer in BufferManagerRelation structure
Previous Message Sami Imseih 2025-10-20 18:39:37 Skip unregistered custom kinds on stats load