Re: Measuring replay lag

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Measuring replay lag
Date: 2017-03-15 23:07:36
Message-ID: CANP8+jLfQw2E+r3jnDLvugupZSadKUZUoVC6MVXzAbRBXpF12Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14 March 2017 at 07:39, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> Hi,
>
> Please see separate replies to Simon and Craig below.
>
> On Sun, Mar 5, 2017 at 8:38 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On 1 March 2017 at 10:47, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>>> I do see why a new user trying this feature for the first time might
>>> expect it to show a lag of 0 just as soon as sent LSN =
>>> write/flush/apply LSN or something like that, but after some
>>> reflection I suspect that it isn't useful information, and it would be
>>> smoke and mirrors rather than real data.
>>
>> Perhaps I am misunderstanding the way it works.
>>
>> If the last time WAL was generated the lag was 14 secs, then nothing
>> occurs for 2 hours aftwards AND all changes have been successfully
>> applied then it should not continue to show 14 secs for the next 2
>> hours.
>>
>> IMHO the lag time should drop to zero in a reasonable time and stay at
>> zero for those 2 hours because there is no current lag.
>>
>> If we want to show historical lag data, I'm supportive of the idea,
>> but we must report an accurate current value when the system is busy
>> and when the system is quiet.
>
> Ok, I thought about this for a bit and have a new idea that I hope
> will be more acceptable. Here are the approaches considered:
>
> 1. Show the last measured lag times on a completely idle system until
> such time as the standby eventually processes more lag, as I had it in
> the v5 patch. You don't like that and I admit that it is not really
> satisfying (even though I know that real Postgres systems alway
> generate more WAL fairly soon even without user sessions, it's not
> great that it depends on an unknown future event to clear the old
> data).
>
> 2. Recognise when the last reported write/flush/apply LSN from the
> standby == end of WAL on the sending server, and show lag times of
> 00:00:00 in all three columns. I consider this entirely bogus: it's
> not an actual measurement that was ever made, and on an active system
> it would flip-flop between real measurements and the artificial
> 00:00:00. I do not like this.

There are two ways of knowing the lag: 1) by measurement/sampling,
which is the main way this patch approaches this, 2) by direct
observation the LSNs match. Both are equally valid ways of
establishing knowledge. Strangely (2) is the only one of those that is
actually precise and yet you say it is bogus. It is actually the
measurements which are approximations of the actual state.

The reality is that the lag can change dis-continuously between zero
and non-zero. I don't think we should hide that from people.

I suspect that your "entirely bogus" feeling comes from the point that
we actually have 3 states, one of which has unknown lag.

A) "Currently caught-up"
WALSender LSN == WALReceiver LSN (info type (1))
At this point the current lag is known precisely to be zero.

B) "Work outstanding, no reply yet"
Immediately after where WALSenderLSN > WALReceiverLSN, yet we haven't
yet received new reply
We expect to stay in this state for however long it takes to receive a
reply, which could be wal_receiver_status_interval or longer if the
lag is greater. At this point we have no measurement of what the lag
is. We could reply NULL since we don't know. We could reply with the
last measured lag when we were last in state C, but if the new reply
was delayed for more than that we'd need to reply that the lag is at
least as high as the delay since last time we left state A.

C) "Continuous flow"
WALSenderLSN > WALReceiverLSN and we have received a reply
(measurement, info type (2))
This is the main case. Easy-ish!

So I think we need to first agree that A and B states exist and how to
report lag in each state.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2017-03-15 23:32:42 Re: Parallel Bitmap scans a bit broken
Previous Message Julien Rouhaud 2017-03-15 22:55:42 Re: pg_stat_wal_write statistics view