Unnecessary delay in streaming replication due to replay lag

From: Asim R P <apraveen(at)pivotal(dot)io>
To: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Cc: Hao Wu <hawu(at)pivotal(dot)io>
Subject: Unnecessary delay in streaming replication due to replay lag
Date: 2020-01-17 04:04:05
Message-ID: CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

Standby does not start walreceiver process until startup process
finishes WAL replay. The more WAL there is to replay, longer is the
delay in starting streaming replication. If replication connection is
temporarily disconnected, this delay becomes a major problem and we
are proposing a solution to avoid the delay.

WAL replay is likely to fall behind when master is processing
write-heavy workload, because WAL is generated by concurrently running
backends on master while only one startup process on standby replays WAL
records in sequence as new WAL is received from master.

Replication connection between walsender and walreceiver may break due
to reasons such as transient network issue, standby going through
restart, etc. The delay in resuming replication connection leads to
lack of high availability - only one copy of WAL is available during
this period.

The problem worsens when the replication is configured to be
synchronous. Commits on master must wait until the WAL replay is
finished on standby, walreceiver is then started and it confirms flush
of WAL upto the commit LSN. If synchronous_commit GUC is set to
remote_write, this behavior is equivalent to tacitly changing it to
remote_apply until the replication connection is re-established!

Has anyone encountered such a problem with streaming replication?

We propose to address this by starting walreceiver without waiting for
startup process to finish replay of WAL. Please see attached
patchset. It can be summarized as follows:

0001 - TAP test to demonstrate the problem.

0002 - The standby startup sequence is changed such that
walreceiver is started by startup process before it begins
to replay WAL.

0003 - Postmaster starts walreceiver if it finds that a
walreceiver process is no longer running and the state
indicates that it is operating as a standby.

This is a POC, we are looking for early feedback on whether the
problem is worth solving and if it makes sense to solve if along this
route.

Hao and Asim

Attachment Content-Type Size
0001-Test-that-replay-of-WAL-logs-on-standby-does-not-aff.patch application/octet-stream 9.2 KB
0003-Start-WAL-receiver-when-it-is-found-not-running.patch application/octet-stream 6.2 KB
0002-Start-WAL-receiver-before-startup-process-replays-ex.patch application/octet-stream 11.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2020-01-17 04:06:41 Re: [HACKERS] Block level parallel vacuum
Previous Message Tom Lane 2020-01-17 04:03:46 Re: BUG #16213: segfault when running a query