Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dimitrios Apostolou <jimis(at)gmx(dot)net>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward
Date: 2025-10-20 21:21:16
Message-ID: 375562.1760995276@sss.pgh.pa.us
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dimitrios Apostolou <jimis(at)gmx(dot)net> writes:
> Regarding the attached patch (rebased and edited commit message), it
> basically replaces seek(up to 1MB forward) with read(). The 1MB number
> comes a bit out of the top of my head. But tweaking it between 128KB and
> 1MB wouldn't really change anything, given that the block size is now
> 128KB: The read() will always be chosen against the seek(). Do you know
> of a real-world case with block sizes >128KB?

Yeah, with the recent changes I'd expect table data to pretty much
always consist of blocks around 128K, unless the table is smaller
than that of course.

I experimented with this patch locally and came away not too
impressed; it seems the results may be highly platform-dependent.

In the interests of having a common benchmark case that's easy to
replicate, here's precisely what I did:

Use non-assert build of current HEAD (4bea91f21 at the moment).

$ createdb bench
$ time pgbench -i -s 10000 bench

real 14m40.474s
user 1m26.717s
sys 0m5.045s
$ psql bench
...
bench=# create table zedtable(f1 int);
CREATE TABLE
bench=# insert into zedtable values(42);
INSERT 0 1
bench=# \q
$ time pg_dump -Fc --compress=none bench | cat >bench10000.dump

real 7m48.969s
user 0m36.334s
sys 1m35.209s

(At this -s value, the database occupies about 147G and the dump
file about 95G. It's important the dump file not fit in RAM.)

$ time pg_restore -f /dev/null -t zedtable bench10000.dump

real 1m12.646s
user 0m0.355s
sys 0m5.083s

This compares rather favorably to "cat":

$ time cat bench10000.dump >/dev/null

real 3m6.627s
user 0m0.167s
sys 0m30.999s

I then applied your patch and repeated the restore run:

$ time pg_restore -f /dev/null -t zedtable bench10000.dump

real 2m39.138s
user 0m0.386s
sys 0m28.493s

So for me, the proposed patch actually makes it 2X slower.

Watching it with "iostat 1", I'm seeing about 40MB/s disk read
rate with HEAD, and 500MB/s with the patch; "cat" also shows
read rate around 500MB/s. So yeah, we can saturate the disk
interface by doing all reads and no seeks, but that doesn't
net out faster.

I did this on a few-years-old Dell Precision 5820 workstation.
The specs for it are a bit vague about the disk subsystem:
Storage Drive Controllers
Integrated Intel AHCI SATA chipset controller (8x 6.0Gb/s), SW RAID 0,1,5,10
Storage Drive
2.5 1.92TB SATA AG Enterprise Solid State Drive
and hdparm isn't enormously helpful either:

ATA device, with non-removable media
Model Number: SSDSC2KB019T8R
Serial Number: PHYF1291017A1P9DGN
Firmware Revision: XCV1DL69
Media Serial Num:
Media Manufacturer:
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Used: unknown (minor revision code 0x006d)
Supported: 10 9 8 7 6 5
Likely used: 10

I'm running RHEL 8.10, file system is xfs.

So I find this a bit discouraging. It's not clear why you're seeing a
win and I'm not, and it's even less clear whether there'd be enough
of a win across enough platforms to make it worth trying to engineer
a solution that helps many more people than it hurts.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2025-10-20 21:22:29 Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats()
Previous Message David E. Wheeler 2025-10-20 21:18:33 Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats()