| From: | Henson Choi <assam258(at)gmail(dot)com> |
|---|---|
| To: | Xuneng Zhou <xunengzhou(at)gmail(dot)com>, Imran Zaheer <imran(dot)zhir(at)gmail(dot)com> |
| Cc: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: [WIP] Pipelined Recovery |
| Date: | 2026-04-03 06:58:39 |
| Message-ID: | CAAAe_zCxg2NTG_i1erLQQr8Wn+6SQ3EMOmp+N4J58Xxb21g2BQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Xuneng, Imran, and everyone,
I’m curious how this approach differs from those previous efforts, and
> why those attempts ultimately did not land.
There is directly relevant prior art that may be worth looking at.
Koichi Suzuki presented parallel recovery at PGCon 2023 [1] and
published a detailed design on the PostgreSQL wiki [2] with a working
prototype on GitHub.
Koichi's approach is quite different from the current patch: instead of
pipelining decode, he parallelizes redo itself by dispatching WAL
records to block workers based on page identity. The key rule is that
for a given page, WAL records are applied in written order, but
different pages can be replayed in parallel by different workers.
His design uses a dispatcher to route records to workers, with
synchronization needed for multi-block WAL records. One thing I
wondered is whether the dispatcher could be avoided entirely: if each
child simply reads the whole WAL stream on its own and skips blocks
that are not assigned to it, there would be no IPC and no need to
coordinate multi-block records across workers.
The hard problem he ran into was Hot Standby visibility: when index and
heap pages are replayed by different workers at different speeds,
concurrent queries can see inconsistent state. The wiki itself notes
the idea is to "use this when hot standby is disabled." As far as I
know, this was never submitted as a patch to hackers.
It also raises an implicit question: what makes the current approach
> more promising—whether due to a simpler design or improved
> performance.
>
The two approaches target different bottlenecks. The current patch
parallelizes WAL decoding, which keeps the redo path single-threaded
and avoids the Hot Standby visibility problem entirely.
One thing I am curious about in the current patch: WAL records are
already in a serialized format on disk. The producer decodes them and
then re-serializes into a different custom format for shm_mq. What is
the advantage of this second serialization format over simply passing
the raw WAL bytes after CRC validation and letting the consumer decode
directly? Offloading CRC to a separate core could still improve
throughput at the cost of higher total CPU usage, without needing the
custom format.
Koichi's approach parallelizes redo (buffer I/O) itself, which attacks
a larger cost — Jakub's flamegraphs show BufferAlloc ->
GetVictimBuffer -> FlushBuffer dominating in both p0 and p1 — but at
the expense of much harder concurrency problems.
Whether the decode pipelining ceiling is high enough, or whether the
redo parallelization complexity is tractable, seems like the central
design question for this area.
[1]
https://www.pgcon.org/2023/schedule/session/392-parallel-recovery-in-postgresql/
[2] https://wiki.postgresql.org/wiki/Parallel_Recovery
Best regards,
Henson
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jelte Fennema-Nio | 2026-04-03 07:52:24 | Re: Add "format" target to make and ninja to run pgindent and pgperltidy |
| Previous Message | Andrei Lepikhov | 2026-04-03 06:42:58 | Flaky test in t/100_vacuumdb.pl: ordering assumption not stable under plan changes |