| From: | Imran Zaheer <imran(dot)zhir(at)gmail(dot)com> |
|---|---|
| To: | Xuneng Zhou <xunengzhou(at)gmail(dot)com> |
| Cc: | assam258(at)gmail(dot)com, Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: [WIP] Pipelined Recovery |
| Date: | 2026-06-23 13:27:10 |
| Message-ID: | CA+UBfamzeXcdEbmhdOdjWn5X_YVt9n8xUpH5ZwmA7S8VWvaoXw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi
I am attaching the new series of patches.
What has changed?
* Rebased
* The patch set is now split into two new patches. This will make the
code easier to understand and review.
* The v4-0003 patch contains code mostly related to keeping the
recovery states synced between the startup process and the pipeline
process. Most of these changes were required to make the streaming
replication work.
* The v4-0002 patch now only contains the consumer code that handles
receiving the decoded records from the shmem queue and moving the redo
loop forward.
* The v4-0004 contains some basic tests to see if the pipeline worker
is functioning as expected. More testing was done by passing
PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on" before running the
recovery test suite.
* Other than that, the cpu overhead during deserialization is
optimized by skipping multiple copies of the decoded record and
directly passing the pointer to the shmem queue. There is still some
overhead visible during serialization that could be improved at the
producer end.
* Signal handling for the pipeline worker is improved so that
promotion signals are sent to both the startup process and the
producer worker by the postmaster.
You will also find the new benchmarks attached [1] and the pdf report
overview. A simple cpu profiling on the pipelined startup process
shows that the cpu overhead during reading records has now been
removed and offloaded to the producer worker.
Before pipelining:
Around 50% of the cpu time is spent on fetching the wal record. Note that
in this workload pipeline is off so don't worry about the new func
ReceiveRecord(), it's just a wrapper around ReadRecord().
Children Self Command Shared O Symbol
- 98.85% 0.21% postgres postgres [.] PerformWalRecovery
- 98.64% PerformWalRecovery
- 51.00% ReceiveRecord
- 50.78% ReadRecord
- 50.52% XLogPrefetcherReadRecord
- 49.61% XLogPrefetcherNextBlock
+ 25.33% XLogReadAhead
+ 22.32% PrefetchSharedBuffer
+ 0.76% smgropen
- 46.68% ApplyWalRecord
+ 29.23% heap_redo
+ 9.51% heap2_redo
+ 4.74% btree_redo
+ 1.11% xlog_redo
+ 0.80% xact_redo
After Pipelining:
Here the only work needed to be done by the cpu is to get the decoded
record from
the queue. Other times (89.13%) cpu is worried about applying the wal record.
Children Self Command Shared O Symbol
- 98.74% 0.37% postgres postgres [.] PerformWalRecovery
- 98.37% PerformWalRecovery
- 89.13% ApplyWalRecord
+ 56.89% heap_redo
+ 18.28% heap2_redo
+ 8.01% btree_redo
+ 2.02% xlog_redo
+ 1.15% xact_redo
- 7.80% ReceiveRecord
+ 7.63% WalPipeline_ReceiveRecord
If the recovery process is not I/O bound then we would be able to test
this cpu optimization. Doing pgbench on a workload that is fully in
memory shows around 30% performance gains. You can see more
benchmarking details in the attached drive link [1]
Some comments related to attached pdf and benchmarking, it is showing
that we can get more performance advantage out of the pipeline when
most of the workload is running in memory i.e. we have enough shared
buffers configured.
If you want to do some experiments, please be my guest; I would be
happy to see more testing. You can share what performance advantage
you are getting from this. You can also refer to the benchmarking
script that I have been using [2].
Looking forward to your review, comments, etc.
Thanks,
Imran Zaheer
[1]: https://drive.google.com/file/d/13FATRT3kjh_y1wWETpYQh4ZXVLNaYU4A/view?usp=sharing
[2]: https://github.com/imranzaheer612/pg-recovery-testing
On Wed, Apr 22, 2026 at 2:44 PM Xuneng Zhou <xunengzhou(at)gmail(dot)com> wrote:
>
> Hi Henson, Imran,
>
> On Wed, Apr 8, 2026 at 7:14 PM Imran Zaheer <imran(dot)zhir(at)gmail(dot)com> wrote:
> >
> > Hi
> >
> > I am uploading the new version with the following fixes
> >
> > * Rebased version.
> > * Skip serialization of decoded records. As pointed out by Henson,
> > there was no need to serialize the records again
> > for the sh_mq. We can simply pass the continuous bytes with minor
> > pointer fixing to the sh_mq
> >
> > This time I am uploading the benchmarking results to drive and
> > attaching the link here. Otherwise my mail will get holded for
> > moderation (My guess is overall attachment size is greater than 1MB thats why).
> >
> > I am still not sure whether my testing approach is good enough.
> > Because sometimes I am not able to get the same performance
> > improvement
> > with the pgbench builtin scripts as I got with the custom sql scripts.
> > Maybe pgbench is not creating enough WAL to test on
> > or maybe I am missing something.
> >
> > Benchmarks: https://drive.google.com/file/d/1Y4SYVnrFEQRE5T2r87rrTr7SWC9m19Si/view?usp=sharing
> >
> > Thanks & Regards
> > Imran Zaheer
> >
> > Imran Zaheer
> >
> > On Wed, Apr 8, 2026 at 1:46 PM Imran Zaheer <imran(dot)zhir(at)gmail(dot)com> wrote:
> > >
> > > >
> > > > Hi Xuneng, Imran, and everyone,
> > > >
> > >
> > > Hi Henson and Xuneng.
> > >
> > > Thanks for explaining the approaches to Xuneng.
> > >
> > > >
> > > > The two approaches target different bottlenecks. The current patch
> > > > parallelizes WAL decoding, which keeps the redo path single-threaded
> > > > and avoids the Hot Standby visibility problem entirely.
> > > >
> > >
> > > You are right both approaches
> > > target different bottlenecks. Pipeline patch aims to improve overall
> > > cpu throughput
> > > and to save CPU time by offloading the steps we can safely do in parallel with
> > > out causing synchronization problems.
> > >
> > > > One thing I am curious about in the current patch: WAL records are
> > > > already in a serialized format on disk. The producer decodes them and
> > > > then re-serializes into a different custom format for shm_mq. What is
> > > > the advantage of this second serialization format over simply passing
> > > > the raw WAL bytes after CRC validation and letting the consumer decode
> > > > directly? Offloading CRC to a separate core could still improve
> > > > throughput at the cost of higher total CPU usage, without needing the
> > > > custom format.
> > > >
> > >
> > > Thanks. You are right there was no need to serialize the decoded record again.
> > > I was not aware that we already have continuous bytes in memory. In my
> > > next patch
> > > I will remove this extra serialization step.
> > >
> > > > Koichi's approach parallelizes redo (buffer I/O) itself, which attacks
> > > > a larger cost — Jakub's flamegraphs show BufferAlloc ->
> > > > GetVictimBuffer -> FlushBuffer dominating in both p0 and p1 — but at
> > > > the expense of much harder concurrency problems.
> > > >
> > > > Whether the decode pipelining ceiling is high enough, or whether the
> > > > redo parallelization complexity is tractable, seems like the central
> > > > design question for this area.
> > >
> > > I still have to investigate the problem related to `GetVictimBuffer` that
> > > Jakub mentioned. But I was trying that how can we safely offload the work done
> > > by `XLogReadBufferForRedoExtended` to a separate
> > > pipeline worker, or maybe we can try prefetching the buffer header so
> > > the main redo
> > > loop doesn't have to spend time getting the buffer
>
> Thanks for your clarification! I'll try to review this patch later.
>
> --
> Best,
> Xuneng
| Attachment | Content-Type | Size |
|---|---|---|
| v4-0004-Pipelined-Recovery-Add-Tap-test.patch | text/x-patch | 7.3 KB |
| v4-0001-Pipelined-Recovery-Producer-Related-Code.patch | text/x-patch | 28.4 KB |
| v4-0002-Pipelined-Recovery-Consumer-Related-Code.patch | text/x-patch | 19.2 KB |
| v4-0003-Pipelined-Recovery-Decoupling-startup-and-produce.patch | text/x-patch | 38.0 KB |
| recoveries-becnhmark-v04.pdf | application/pdf | 35.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ilia Evdokimov | 2026-06-23 13:27:28 | Re: Hash-based MCV matching for large IN-lists |
| Previous Message | Robert Haas | 2026-06-23 12:10:52 | Re: use of SPI by postgresImportForeignStatistics |