Re: Measuring replay lag

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Measuring replay lag
Date: 2016-12-20 03:51:41
Message-ID: CAEepm=340OrYmhLiLKNyiZxbiuzqkN5+ywYAMTA-Uhp4OECk6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 19, 2016 at 10:46 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 26 October 2016 at 11:34, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
>> It works by taking advantage of the { time, end-of-WAL } samples that
>> sending servers already include in message headers to standbys. That
>> seems to provide a pretty good proxy for when the WAL was written, if
>> you ignore messages where the LSN hasn't advanced. The patch
>> introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
>> control how often the standby should record these timestamped LSNs
>> into a small circular buffer. When its recovery process eventually
>> replays a timestamped LSN, the timestamp is sent back to the upstream
>> server in a new reply message field. The value visible in
>> pg_stat_replication.replay_lag can then be updated by comparing with
>> the current time.
>
> Why not just send back the lag as calculated by max_standby_streaming_delay?
> I.e. at the end of replay of each chunk record the current delay in
> shmem, then send it back periodically.
>
> If we have two methods of calculation it would be confusing.

Hmm. If I understand correctly, GetStandbyLimitTime is measuring
something a bit different: it computes how long it has been since the
recovery process received the chunk that it's currently trying to
apply, most interestingly in the case where we are waiting due to
conflicts. It doesn't include time in walsender, on the network, in
walreceiver, or writing and flushing and reading before it arrives in
the recovery process. If I'm reading it correctly, it only updates
XLogReceiptTime when it's completely caught up applying all earlier
chunks, so when it falls behind, its measure of lag has a growing-only
phase and a reset that can only be triggered by catching up to the
latest chunk. That seems OK for its intended purpose of putting a cap
on the delay introduced by conflicts. But that's not what I'm trying
to provide here.

The purpose of this proposal is to show the replay_lag as judged by
the sending server: in the case of a primary server, that is an
indication of how commits done here will take to show up to users over
there, and how long COMMIT will take with remote_apply will take to
come back. It measures the WAL's whole journey, and does so in a
smooth way that shows accurate information even if the standby never
quite catches up during long periods.

Example 1: Suppose I have two servers right next each other, and the
primary server has periods of high activity which exceed the standby's
replay rate, perhaps because of slower/busier hardware, or because of
conflicts with other queries, or because our single-core 'REDO' can't
always keep up with multi-core 'DO'. By the logic of
max_standby_streaming_delay, if it never catches up to the latest
chunk but remains a fluctuating number of chunks behind, then AIUI the
standby will compute a constantly increasing lag. By my logic, the
primary will tell you quite accurately how far behind the standby's
recovery is at regular intervals, showing replay_lag fluctuating up
and down as appropriate, even if it never quite catches up. It can do
that because it has a buffer full of regularly spaced out samples to
work through, and even if you exceed the buffer size (8192 seconds'
worth by default) it'll just increase the interval between samples.

Example 2: Suppose I have servers on opposite sides of the world with
a ping time of 300ms. By the logic used for
max_standby_streaming_delay, the lag computed by the standby would be
close to zero when there is no concurrent activity to conflict with.
I don't think that's what users other than the recovery-conflict
resolution code want to know. By my logic, replay_lag computed by the
primary would show 300ms + a tiny bit more, which is how long it takes
for committed transactions to be visible to user queries on the
standby and for us to know that that is the case. That's interesting
because it tells you how long synchronous_commit = remote_apply would
make you wait (if that server is waited for according to syncrep
config).

In summary, the max_standby_streaming_delay approach only measures
activity inside the recovery process on the standby, and only uses a
single variable for timestamp tracking, so although it's semi-related
it's not what I wanted to show.

(I suppose there might be an argument that max_standby_streaming_delay
should also track received-on-standby-time for each sampled LSN in a
circular buffer, and then use that information to implement
max_standby_streaming_delay more fairly. We only need to cancel
queries that conflict with WAL records that have truly been waiting
max_standby_streaming_delay since receive time, instead of cancelling
everything that conflicts with recovery until we're caught up to the
last chunk as we do today as soon as max_standby_streaming_delay is
exceeded while trying to apply *any* WAL record. This may not make
any sense or be worth doing, just an idea...)

> Admittedly the approach here is the same one I advocated a some years
> back when Robert and I were discussing time delayed standbys.

That is reassuring!

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-12-20 03:59:18 Re: Declarative partitioning - another take
Previous Message Robert Haas 2016-12-20 03:47:20 Re: pg_authid.rolpassword format (was Re: Password identifiers, protocol aging and SCRAM protocol)