Re: Measuring replay lag

From: Ian Barwick <ian(dot)barwick(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Measuring replay lag
Date: 2017-03-15 07:15:51
Message-ID: 19941ed0-c969-a119-0ed6-238f82a8341d@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

Just adding a couple of thoughts on this.

On 03/14/2017 08:39 AM, Thomas Munro wrote:
> Hi,
>
> Please see separate replies to Simon and Craig below.
>
> On Sun, Mar 5, 2017 at 8:38 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On 1 March 2017 at 10:47, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>>> I do see why a new user trying this feature for the first time might
>>> expect it to show a lag of 0 just as soon as sent LSN =
>>> write/flush/apply LSN or something like that, but after some
>>> reflection I suspect that it isn't useful information, and it would be
>>> smoke and mirrors rather than real data.
>>
>> Perhaps I am misunderstanding the way it works.
>>
>> If the last time WAL was generated the lag was 14 secs, then nothing
>> occurs for 2 hours aftwards AND all changes have been successfully
>> applied then it should not continue to show 14 secs for the next 2
>> hours.
>>
>> IMHO the lag time should drop to zero in a reasonable time and stay at
>> zero for those 2 hours because there is no current lag.
>>
>> If we want to show historical lag data, I'm supportive of the idea,
>> but we must report an accurate current value when the system is busy
>> and when the system is quiet.
>
> Ok, I thought about this for a bit and have a new idea that I hope
> will be more acceptable. Here are the approaches considered:

(...)
> 2. Recognise when the last reported write/flush/apply LSN from the
> standby == end of WAL on the sending server, and show lag times of
> 00:00:00 in all three columns. I consider this entirely bogus: it's
> not an actual measurement that was ever made, and on an active system
> it would flip-flop between real measurements and the artificial
> 00:00:00. I do not like this.

I agree with this; while initially I was expecting to see 00:00:00,
SQL NULL is definitely correct here. Anyone writing tools etc. which need to
report an actual interval can convert this to 00:00:00 easily enough .

(...)

> 5. The new proposal: Show only true measured write/flush/apply data,
> as in 1, but with a time limit. To avoid the scenario where we show
> the same times during prolonged periods of idleness, clear the numbers
> like in option 3 after a period of idleness. This way we avoid the
> dreaded flickering/flip-flopping. A natural time to do that is when
> wal_receiver_status_interval expires on idle systems and defaults to
> 10 seconds.
>
> Done using approach 5 in the attached version. Do you think this is a
> good compromise? No bogus numbers, only true measured
> write/flush/apply times, but a time limit on 'stale' lag information.

This makes sense to me. I'd also add that while on production servers
it's likely there'll be enough activity to keep the columns updated,
on a quiescent test/development systems seeing a stale value looks plain
wrong (and will cause no end of questions from people asking why lag
is still showing when their system isn't doing anything).

I suggest the documentation of these columns needs to be extended to mention
that they will be NULL if no lag was measured recently, and to explain
the circumstances in which the numbers are cleared.

Regards

Ian Barwick

--
Ian Barwick http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Ian Barwick http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Massimo Fidanza 2017-03-15 07:20:48 Re: New procedural language
Previous Message Tsunakawa, Takayuki 2017-03-15 07:11:47 Re: Defaulting psql to ON_ERROR_ROLLBACK=interactive