Quick Links

Re: [WIP] Pipelined Recovery

From:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To:	Imran Zaheer <imran(dot)zhir(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [WIP] Pipelined Recovery
Date:	2026-02-04 09:48:05
Message-ID:	CAKZiRmyFjGnscqrSS4+61RHxzEDXE3c08v48GAEXFkvsJx-yUw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Jan 30, 2026 at 3:56 PM Imran Zaheer <imran(dot)zhir(at)gmail(dot)com> wrote:
>
> Hi,
>
> Based on a suggestion by my colleague Ants Aasma, I worked on this
> idea of adding parallelism to the WAL recovery process.
>
> The crux of this idea is to decode the WAL using parallel workers. Now
> the replay process can get the records from the shared memory queue
> directly. This way, we can decrease some CPU load on the recovery process.
>
> Implementing this idea yielded an improvement of around 20% in the
> recovery times, but results may differ based on workloads. I have
> attached some benchmarks for different workloads.
>
> Following are some recovery tests with the default configs. Here p1
> shows pipeline enabled. (db size) is the backup database size on
> which the recovery happens. You can see more detail related to the
> benchmarks in the attached file `recoveries-benchmark-v01`.
>
> elapsed (p0) elapsed (p1) % perf db size
>
> inserts.sql 272s 10ms 197s 570ms 27.37% 480 MB
> updates.sql 177s 420ms 117s 80ms 34.01% 480 MB
> hot-updates.sql 36s 940ms 29s 240ms 20.84% 480 MB
> nonhot.sql 36s 570ms 28s 980ms 20.75% 480 MB
> simple-update 20s 160ms 11s 580ms 42.56% 4913 MB
> tpcb-like 20s 590ms 13s 640ms 33.75% 4913 MB
>
> Similar approach was also suggested by Matthias van de Meent earlier in a
> separate thread [1]. Right now I am using one bgw for decoding and filling
> up the shared message queue, and the redo apply loop simply receives the decoded record
> from the queue. After the redo is finished, the consumer (startup
> process) can request a shutdown from the producer (pipeline bgw)
> before exiting recovery.
>
> This idea can be coupled with another idea of pinning the buffers in
> parallel before the recovery process needs them. This will try to
> parallelize most of the work being done in
> `XLogReadBufferForRedoExtended`. The Redo can simply receive
> the already pinned buffers from a queue, but for implementing
> this, we still need some R&D on that, as IPC and pinning/unpinning of
> buffers across two processes can be tricky.
>
> If someone wants to reproduce the benchmark, they can do so using
> these scripts [2].
>
> Looking forward to your reviews, comments, etc.

Hi Imran,

It's great that you are undertaking such a cool project (I also think
that recovery
performance is one of the two biggest performance limitations today).
I've taken a
quick look at the attached benchmarks and to my surprise they have flamegraphs!

Thoughts on results:

- on the flamegraphs BufferAlloc->GetVictimBuffer->FlushBuffer() is visible
often in both scenarios p0/p1 . I recall from my old WAL IO prefetching stress
tests experiments [1] that tuning bgwriter was playing some important role
(most of the time bgwriter values are left on default). It's connected to
the pipelining idea too for sure, and somehow we are not effective on it
based on Your's and mine results and some other field reports. Basically we
have wrong defaults for bgwriter on standbys, and no-one is going to notice
unless you measure it in-depth. We could also tweak the bgwriter in standby
role to make the pipelining more effective that way too, but maybe I am missing
something? Maybe bgwriter on standby should work full-steam instead of being
rate limited as long as there is work to be done (pressure) (?) It's not a
critique of the patch, but maybe have guys investigated that road too as it is
related to the pipelining concept?

- there's plenty of ValidXLogRecord->pg_comp_crc32c_sb8(), probably even too
much, so I think either old CPU was used or something happened that SSE4.2
was not available (pg_comp_crc32c_sse42()). I'm speculating, but probably
you would even getter results from the patch (which are impressive anyway)
by using pg_comp_crc32c_sse42() because it wouldn't be such a
bottleneck anymore.

- The flamegraphs itself are for whole Postgresql, right or am I
misunderstanding it? Probably in the long run it would be better to have just
the PID of startup/recovering (but that's way harder to script for sure,
due to the need for isolated PID: perf record -p <PID>)

- You need to rebase due to due to 1eb09ed63a8 from couple of days ago:
CONFLICT (content): Merge conflict in src/backend/postmaster/bgworker.c

- naming: I don't have anything against wal_pipeline.c, it's just that
the suite of C files is named src/backend/access/transam/xlog*.c
(so perhaps xlogrecoverypipeline.c (?) to stay consistent)

- I'm was getting lots of problems during linking (e.g. undefined reference to
`WalPipeline_IsActive') and it appears that the patch is missing
meson support:

--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -25,6 +25,7 @@ backend_sources += files(
'xlogstats.c',
'xlogutils.c',
'xlogwait.c',
+'wal_pipeline.c'

- nitpicking, but you need to add wal_pipeline to postgresql.conf.sample
otherwise 003_check_guc.pl test fails.

- good news is that test with PG_TEST_EXTRA="wal_consistency_checking" passes

- but bad news is that PG_TEST_EXTRA="wal_consistency_checking" with
PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on" which actives all of this does
not multiple tests (possibly I look for those in the following days,
unless you are faster)

- at least here, fresh after pg_basebackup -R for building standby and
restart with wal_pipeline=on
I've got bug:
2026-02-04 10:29:40.146 CET [335197] LOG: [walpipeline] started.
2026-02-04 10:29:40.146 CET [335197] LOG: redo starts at 0/02000028
2026-02-04 10:29:40.147 CET [335198] LOG: invalid record length
at 0/03000060: expected at least 24, got 0
2026-02-04 10:29:40.147 CET [335197] LOG: consistent recovery
state reached at 0/03000060
2026-02-04 10:29:40.147 CET [335197] LOG: [walpipeline] consumer:
received shutdown message from the producer
2026-02-04 10:29:40.147 CET [335191] LOG: database system is
ready to accept read-only connections
2026-02-04 10:29:40.157 CET [335198] LOG: [walpipeline] producer:
exiting: sent=5 received=5
2026-02-04 10:29:40.167 CET [335197] LOG: WAL pipeline stopped
2026-02-04 10:29:40.167 CET [335197] LOG: redo done at 0/03000028
system usage: CPU: user: 0.00 s, system: 0.03 s, elapsed: 0.04 s
2026-02-04 10:29:40.169 CET [335197] LOG: selected new timeline ID: 2
2026-02-04 10:29:40.184 CET [335197] LOG: archive recovery complete

- even when starting from scratch with wal_pipeline = on it's similiar:
2026-02-04 10:34:14.386 CET [335833] LOG: database system was
interrupted; last known up at 2026-02-04 10:34:02 CET
2026-02-04 10:34:14.453 CET [335833] LOG: starting backup recovery
with redo LSN 0/0B0000E0, checkpoint LSN 0/0B000138, on timeline ID 1
2026-02-04 10:34:14.453 CET [335833] LOG: entering standby mode
2026-02-04 10:34:14.482 CET [335833] LOG: [walpipeline] started.
2026-02-04 10:34:14.482 CET [335833] LOG: redo starts at 0/0B0000E0
2026-02-04 10:34:14.484 CET [335833] LOG: completed backup
recovery with redo LSN 0/0B0000E0 and end LSN 0/0B0001E0
2026-02-04 10:34:14.484 CET [335833] LOG: consistent recovery
state reached at 0/0B0001E0
2026-02-04 10:34:14.484 CET [335833] LOG: [walpipeline] consumer:
received shutdown message from the producer
2026-02-04 10:34:14.484 CET [335827] LOG: database system is ready
to accept read-only connections
2026-02-04 10:34:14.493 CET [335834] LOG: [walpipeline] producer:
exiting: sent=4 received=4
2026-02-04 10:34:14.504 CET [335833] LOG: WAL pipeline stopped
2026-02-04 10:34:14.504 CET [335833] LOG: redo done at 0/0B0001E0
system usage: CPU: user: 0.00 s, system: 0.03 s, elapsed: 0.04 s
2026-02-04 10:34:14.506 CET [335833] LOG: selected new timeline ID: 2

it appears that "Cannot read more records, shut it down" ->
WalPipeline_SendShutdown() path is taken but I haven't pursued further.

Also some very quick review comments:

+bool
+WalPipeline_SendRecord(XLogReaderState *record)
+{
[..]
+
+ if (msglen > WAL_PIPELINE_MAX_MSG_SIZE)
+ {
+ elog(WARNING, "[walpipeline] producer: wal record at %X/%X
too large (%zu bytes), skipping",
+ LSN_FORMAT_ARGS(record->ReadRecPtr), msglen);
+ pfree(buffer);
+ return true;
+ }

When/why it could happen and if it would happen, shouldn't this be more
like PANIC instead?

+/* Size of the shared memory queue (can be made configurable) */
+#define WAL_PIPELINE_QUEUE_SIZE (128 * 1024 * 1024) /* 8 MB */
+
+/* Maximum size of a single message */
+#define WAL_PIPELINE_MAX_MSG_SIZE (2 * 1024 * 1024) /* 1 MB */

Maybe those should take into account (XLOG_)BLCKSZ too?

-J.

[1] - https://www.postgresql.org/message-id/VI1PR0701MB69608CBCE44D80857E59572EF6CA0@VI1PR0701MB6960.eurprd07.prod.outlook.com

In response to

[WIP] Pipelined Recovery at 2026-01-30 06:28:47 from Imran Zaheer

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Álvaro Herrera	2026-02-04 09:48:23	Re: Truncate logs by max_log_size
Previous Message	Richard Guo	2026-02-04 09:47:37	Re: Convert NOT IN sublinks to anti-joins when safe