Sending unflushed WAL in physical replication

From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Sending unflushed WAL in physical replication
Date: 2025-09-25 19:02:28
Message-ID: CAH2L28tHzvZgtL7MHDK86Rzz56f+74mgZo-uKQNJHob7_JDb-w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Please find attached a POC patch that introduces changes to the WAL sender
and
receiver, allowing WAL records to be sent to standbys before they are
flushed
to disk on the primary during physical replication. This is intended to
improve
replication latency by reducing the amount of WAL read from disk.
For large transactions, this approach ensures that the bulk of the
transaction’s
WAL records are already sent to the standby before the flush occurs on the
primary.
As a result, the flush on the primary and standby happen closer together,
reducing replication lag.

Observations from the benchmark:
1. The patch improves TPS by ~13% in the sync replication setup. In
repeated runs,
I see that the TPS increase is anywhere between 5% to 13% .
2. WAL sender reads significantly less WAL from disk, indicating more
efficient use
of WAL buffers and reduced disk I/O

Following are some of the details of the implementation:

1. Primary does not wait for flush before starting to send data, so it is
likely to
send smaller chunks of data. To prevent network overload, changes are made
to
avoid sending excessively small packets.
2. The sender includes the current flush pointer in the replication
protocol
messages, so the standby knows up to which point WAL has been safely
flushed
on the primary.
3. The logic ensures that standbys do not apply transactions that have not
been flushed on the primary, by updating the flushedUpto position on the
standby
only up to the flushPtr received from the primary.
4. WAL records received from the primary are written and can be flushed to
disk on the
standby, but are only marked as flushed up to the flushPtr reported by the
primary.

Benchmark details are as follows:
Synchronous replication with remote write enabled.
Two Azure VMs: Central India (primary), Central US (standby).
OS: Ubuntu 24.04, VM size D4s (4 vCPUs, 16 GiB RAM).

With patch
TPS : 115
WAL read from disk by wal sender : ~40MB (read bytes from pg_stat_io)
WAL generated during the test: 772705760 bytes.

Without the patch
TPS: 102
WAL read from disk by wal sender : ~79MB (read bytes from pg_stat_io)
WAL generated during the test : 760060792 bytes

Commit hash: b1187266e0

pgbench -c 32 -j 4 postgres -T 300 -f wal_test.sql

wal_test.sql (each transaction generates ~36KB of WAL):
\set delta random(1, 500)
BEGIN;
INSERT INTO wal_bloat_:delta (data)
SELECT repeat('x', 8000)
FROM generate_series(1, 80);

TODO:
1. Ensure there is a robust mechanism on the receiver to prevent WAL
records
that are not flushed on primary from being applied on standby, under any
circumstances.
2. When smaller chunks of WAL are received on the standby, it can lead to
more
frequent disk write operations. To mitigate this issue, employing WAL
buffers
on the standby could be a more effective approach. Evaluate the performance
impact of using WAL buffers on the standby.

Similar idea was proposed here:
Proposal: Allow walsenders to send WAL directly from wal_buffers to replicas
<https://www.postgresql.org/message-id/flat/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com>
This idea is also discussed here recently :
https://www.postgresql.org/message-id/fa2e932eeff472250e2dbacb49d8c43ad282fea9.camel%40j-davis.com

Kindly let me know your thoughts.

Thank you,
Rahila Syed

Attachment Content-Type Size
0001-Changes-for-sending-of-WAL-records-before-flush.txt text/plain 15.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-09-25 19:10:19 Re: a couple of small patches for simd.h
Previous Message Пополитов Владлен 2025-09-25 18:54:34 Re: Avoiding roundoff error in pg_sleep()