Re: Unnecessary delay in streaming replication due to replay lag

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Asim R P <apraveen(at)pivotal(dot)io>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Hao Wu <hawu(at)pivotal(dot)io>
Subject: Re: Unnecessary delay in streaming replication due to replay lag
Date: 2020-01-17 05:37:56
Message-ID: 20200117053756.GI2127@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 17, 2020 at 09:34:05AM +0530, Asim R P wrote:
> Standby does not start walreceiver process until startup process
> finishes WAL replay. The more WAL there is to replay, longer is the
> delay in starting streaming replication. If replication connection is
> temporarily disconnected, this delay becomes a major problem and we
> are proposing a solution to avoid the delay.

Yeah, that's documented:
https://www.postgresql.org/message-id/20190910062325.GD11737@paquier.xyz

> We propose to address this by starting walreceiver without waiting for
> startup process to finish replay of WAL. Please see attached
> patchset. It can be summarized as follows:
>
> 0001 - TAP test to demonstrate the problem.

There is no real need for debug_replay_delay because we have already
recovery_min_apply_delay, no? That would count only after consistency
has been reached, and only for COMMIT records, but your test would be
enough with that.

> 0002 - The standby startup sequence is changed such that
> walreceiver is started by startup process before it begins
> to replay WAL.

See below.

> 0003 - Postmaster starts walreceiver if it finds that a
> walreceiver process is no longer running and the state
> indicates that it is operating as a standby.

I have not checked in details, but I smell some race conditions
between the postmaster and the startup process here.

> This is a POC, we are looking for early feedback on whether the
> problem is worth solving and if it makes sense to solve if along this
> route.

You are not the first person interested in this problem, we have a
patch registered in this CF to control the timing when a WAL receiver
is started at recovery:
https://commitfest.postgresql.org/26/1995/
https://www.postgresql.org/message-id/b271715f-f945-35b0-d1f5-c9de3e56f65e@postgrespro.ru

I am pretty sure that we should not change the default behavior to
start the WAL receiver after replaying everything from the archives to
avoid copying some WAL segments for nothing, so being able to use a
GUC switch should be the way to go, and Konstantin's latest patch was
using this approach. Your patch 0002 adds visibly a third mode: start
immediately on top of the two ones already proposed:
- Start after replaying all WAL available locally and in the
archives.
- Start after reaching a consistent point.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-01-17 06:03:56 Re: [HACKERS] Block level parallel vacuum
Previous Message Dilip Kumar 2020-01-17 05:30:04 Re: [HACKERS] Block level parallel vacuum