Re: Unnecessary delay in streaming replication due to replay lag

From: sunil s <sunilfeb26(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Huansong Fu <huansong(dot)fu(dot)info(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Unnecessary delay in streaming replication due to replay lag
Date: 2025-11-05 14:05:27
Message-ID: CAOG6S4_fFCU6iV4uvrdC8oDRmQqbjn8cBcpTXQjAE0W8sCQrAg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> When this parameter is set to 'startup' or 'consistency', what happens
> if replication begins early and the startup process fails to replay
> a WAL record—say, due to corruption—before reaching the replication
> start point? In that case, the standby might fail to recover correctly
> because of missing WAL records,

Let’s compare with and without these patch changes ,

Without the patch:

*Scenario 1:* With a large recovery_min_apply_delay (e.g., 2 hours)
Even in this case, the flush acknowledgment for streamed WALs is sent, and
the primary already recycled those WAL files.
If a corrupted record is encountered later during replay then streaming of
those records is not possible.

*Scenario 2:* With recovery_min_apply_delay = 0 or in normal standby
operation

In this case the restart_lsn is advanced based on flushPtr, allowing the
primary to recycle the corresponding WAL files.

If a corrupt record is encountered during replaying local wal records, then
streaming will also fail here right ?.

With this patch:

Starting the WAL receiver early(let’s say at consistent point) will allow
us to prefetch the records more early in the redo loop instead of waiting
till we exhaust locally available wal.

Even if the WAL receiver hadn’t started early, those WAL segments would
have been recycled, since the restart_lsn would have advanced.
Therefore, the record corruption behaviour is unchanged, but the benefit
from this patch is reduced replay lag.

- Reduces replay lag when recovery_min_apply_delay is large, as reported
in
https://www.postgresql.org/message-id/201901301432.p3utg64hum27%40alvherre.pgsql
[2].
- Mitigates delay for standbys lagging due to network bandwidth or
latency or slow disk write(HDD).
- faster recovery
- Currently till wal reciver is started the acknowledgement for
commit is not sent for waiting transaction, since wal reciver is not
running.With this new change the waiting transaction will get
unblocked as
soon as we apply the record.

In normal condition also the slot is advanced based on flushPtr, even if
the mode is remote_apply.We fixed a corrupt scenario for cont record at the
end of last locally available segment.Previously we were starting
at the last stage/corrupt record(like cont record [1] ) but now much early.

If there is a situation where the wal record is retained in primary then we
can restart the wal receiver from old lsnptr in case of corrupt record,
which would be older LSN than what we are starting as part of early
streaming.
This same mechanism is used in standby where we switch b/w wal source.I
don’t see any scenario where the new workflow would break existing behavior.
Could you point out the specific case you’re concerned about? Understanding
that will help us refine the implementation.

> while a transaction waiting for synchronous replication may have already
been acknowledged as committed.
> Wouldn't that lead to a serious problem?

Without the patch:

If the synchronous replication mode is flush(on), then even with a
recovery_min_apply_delay set for larger value(e.g., 2 hours), the
transaction is acknowledged as committed before the record is actually
applied on the standby.

If the mode is remote_apply, the primary waits until the record is applied
on the standby, which includes waiting for the configured recovery delay.

With the patch:

The behavior remains the same with respect to synchronous_commit — it still
depends on whether the mode is flush or remote_apply.

So we can see a similar situation when recovery_min_apply_delay set for
larger value(e.g., 2 hours)/a slow apply situation where all the wal files
are streamed but not replayed.

*AFAIU this patch doesn’t introduce any new behavior.In a normal situation
where the WAL receiver is continuously streaming, we would anyway received
those WAL segments without waiting for*
*replaying to finish right.*

The only difference is we are initating walreciever more early in the
recovery loop, which will going to benifit us in many ways.In system where
replay is slow due to low power hardware/system resource or the
low network bandwidth/slower disk write (HDD) will makes the standby to
lag behind Primary.

By prefetching the wal records early will avoid more wal build up in
primary, which would avoid running out of disk space and also benifit us
for faster standby recovery.
Faster recovery means faster application availability/lower downtime in
case of sync commit enabled.

> src/test/recovery/t/050_archive_enabled_standby.pl is missing the
> ending newline. Is that intentional?
Thanks for reporting. Fixed in the new rebased patch.

Reference:
[1]
https://github.com/postgres/postgres/commit/0668719801838aa6a8bda330ff9b3d20097ea844
[2]
https://www.postgresql.org/message-id/201901301432.p3utg64hum27%40alvherre.pgsql

Thanks & Regards,
Sunil S

Attachment Content-Type Size
v9-0001-Introduce-feature-to-start-WAL-receiver-eagerly.patch application/octet-stream 15.5 KB
v9-0002-Test-WAL-receiver-early-start-upon-reaching-consi.patch application/octet-stream 4.7 KB
v9-0003-Test-archive-recovery-takes-precedence-over-strea.patch application/octet-stream 4.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2025-11-05 14:05:48 Re: Changing the state of data checksums in a running cluster
Previous Message Xuneng Zhou 2025-11-05 14:03:25 Re: Implement waiting for wal lsn replay: reloaded