Quick Links

Sending unflushed WAL in physical replication

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Sending unflushed WAL in physical replication
Date:	2025-09-25 19:02:28
Message-ID:	CAH2L28tHzvZgtL7MHDK86Rzz56f+74mgZo-uKQNJHob7_JDb-w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Please find attached a POC patch that introduces changes to the WAL sender
and
receiver, allowing WAL records to be sent to standbys before they are
flushed
to disk on the primary during physical replication. This is intended to
improve
replication latency by reducing the amount of WAL read from disk.
For large transactions, this approach ensures that the bulk of the
transaction’s
WAL records are already sent to the standby before the flush occurs on the
primary.
As a result, the flush on the primary and standby happen closer together,
reducing replication lag.

Observations from the benchmark:
1. The patch improves TPS by ~13% in the sync replication setup. In
repeated runs,
I see that the TPS increase is anywhere between 5% to 13% .
2. WAL sender reads significantly less WAL from disk, indicating more
efficient use
of WAL buffers and reduced disk I/O

Following are some of the details of the implementation:

1. Primary does not wait for flush before starting to send data, so it is
likely to
send smaller chunks of data. To prevent network overload, changes are made
to
avoid sending excessively small packets.
2. The sender includes the current flush pointer in the replication
protocol
messages, so the standby knows up to which point WAL has been safely
flushed
on the primary.
3. The logic ensures that standbys do not apply transactions that have not
been flushed on the primary, by updating the flushedUpto position on the
standby
only up to the flushPtr received from the primary.
4. WAL records received from the primary are written and can be flushed to
disk on the
standby, but are only marked as flushed up to the flushPtr reported by the
primary.

Benchmark details are as follows:
Synchronous replication with remote write enabled.
Two Azure VMs: Central India (primary), Central US (standby).
OS: Ubuntu 24.04, VM size D4s (4 vCPUs, 16 GiB RAM).

With patch
TPS : 115
WAL read from disk by wal sender : ~40MB (read bytes from pg_stat_io)
WAL generated during the test: 772705760 bytes.

Without the patch
TPS: 102
WAL read from disk by wal sender : ~79MB (read bytes from pg_stat_io)
WAL generated during the test : 760060792 bytes

Commit hash: b1187266e0

pgbench -c 32 -j 4 postgres -T 300 -f wal_test.sql

wal_test.sql (each transaction generates ~36KB of WAL):
\set delta random(1, 500)
BEGIN;
INSERT INTO wal_bloat_:delta (data)
SELECT repeat('x', 8000)
FROM generate_series(1, 80);

TODO:
1. Ensure there is a robust mechanism on the receiver to prevent WAL
records
that are not flushed on primary from being applied on standby, under any
circumstances.
2. When smaller chunks of WAL are received on the standby, it can lead to
more
frequent disk write operations. To mitigate this issue, employing WAL
buffers
on the standby could be a more effective approach. Evaluate the performance
impact of using WAL buffers on the standby.

Similar idea was proposed here:
Proposal: Allow walsenders to send WAL directly from wal_buffers to replicas
<https://www.postgresql.org/message-id/flat/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com>
This idea is also discussed here recently :
https://www.postgresql.org/message-id/fa2e932eeff472250e2dbacb49d8c43ad282fea9.camel%40j-davis.com

Kindly let me know your thoughts.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
0001-Changes-for-sending-of-WAL-records-before-flush.txt	text/plain	15.4 KB

Responses

Re: Sending unflushed WAL in physical replication at 2025-09-27 08:23:47 from SATYANARAYANA NARLAPURAM
Re: Sending unflushed WAL in physical replication at 2025-09-27 11:46:02 from Andrey Borodin

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2025-09-25 19:10:19	Re: a couple of small patches for simd.h
Previous Message	Пополитов Владлен	2025-09-25 18:54:34	Re: Avoiding roundoff error in pg_sleep()