Re: Sending unflushed WAL in physical replication

From: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Sending unflushed WAL in physical replication
Date: 2025-09-27 08:23:47
Message-ID: CAHg+QDetZC7eqc70Fsw3XYpdfPG70kHOyML2nCu9KnPDKtUjpg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Rahila,

On Thu, Sep 25, 2025 at 12:02 PM Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:

> Hi,
>
> Please find attached a POC patch that introduces changes to the WAL sender
> and
> receiver, allowing WAL records to be sent to standbys before they are
> flushed
> to disk on the primary during physical replication. This is intended to
> improve
> replication latency by reducing the amount of WAL read from disk.
> For large transactions, this approach ensures that the bulk of the
> transaction’s
> WAL records are already sent to the standby before the flush occurs on the
> primary.
> As a result, the flush on the primary and standby happen closer together,
> reducing replication lag.
>

At the high level idea LGTM.

>
> Observations from the benchmark:
> 1. The patch improves TPS by ~13% in the sync replication setup. In
> repeated runs,
> I see that the TPS increase is anywhere between 5% to 13% .
> 2. WAL sender reads significantly less WAL from disk, indicating more
> efficient use
> of WAL buffers and reduced disk I/O
>

Can you please measure the transaction commit latency improvement as well.
Commit latency = Primary_Disk_Flush_time + Standby_disk_fluish_time +
network_roundtrip_time

>
> Following are some of the details of the implementation:
>
> 1. Primary does not wait for flush before starting to send data, so it is
> likely to
> send smaller chunks of data. To prevent network overload, changes are made
> to
> avoid sending excessively small packets.
> 2. The sender includes the current flush pointer in the replication
> protocol
> messages, so the standby knows up to which point WAL has been safely
> flushed
> on the primary.
> 3. The logic ensures that standbys do not apply transactions that have not
> been flushed on the primary, by updating the flushedUpto position on the
> standby
> only up to the flushPtr received from the primary.
> 4. WAL records received from the primary are written and can be flushed to
> disk on the
> standby, but are only marked as flushed up to the flushPtr reported by the
> primary.
>

What happens in crash recovery scenarios? For example, when a standby crash
restart,
it replays until the end of WAL. In this case, it may end up replaying WAL
that was
never flushed on the primary (if primary does a crash recovery).
Shouldn't archive on standby not upload WAL before WAL gets flushed on the
primary?
Same applicable for pg_receivewal.

Thanks,
Satya

>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniil Davydov 2025-09-27 09:20:45 Re: Fix bug with accessing to temporary tables of other sessions
Previous Message Maciek Sakrejda 2025-09-27 00:31:43 Re: V18 change on EXPLAIN ANALYZE