Re: Measuring replay lag

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Measuring replay lag
Date: 2017-03-01 10:47:49
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Fri, Feb 24, 2017 at 9:05 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 21 February 2017 at 21:38, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> However, I think a call like LagTrackerWrite(SendRqstPtr,
>> GetCurrentTimestamp()) needs to go into XLogSendLogical, to mirror
>> what happens in XLogSendPhysical. I'm not sure about that.
> Me neither, but I think we need this for both physical and logical.
> Same use cases graphs for both, I think. There might be issues with
> the way LSNs work for logical.

This seems to be problematic. Logical peers report LSN changes for
all three operations (write, flush, commit) only on commit. I suppose
that might work OK for synchronous replication, but it makes it a bit
difficult to get lag measurements that don't look really strange and
sawtoothy when you have long transactions, and overlapping
transactions might interfere with the measurements in odd ways. I
wonder if the way LSNs are reported by logical rep would need to be
changed first. I need to study this some more and would be grateful
for ideas from any of the logical rep people.

>> I added a fourth case 'overwhelm.png' which you might find
>> interesting. It's essentially like one 'burst' followed by a 100% ide
>> primary. The primary stops sending new WAL around 50 seconds in and
>> then there is no autovacuum, nothing happening at all. The standby
>> start is still replaying its backlog of WAL, but is sending back
>> replies only every 10 seconds (because no WAL arriving so no other
>> reason to send replies except status message timeout, which could be
>> lowered). So we see some big steps, and then we finally see it
>> flat-line around 60 seconds because there is still now new WAL so we
>> keep showing the last measured lag. If new WAL is flushed it will pop
>> back to 0ish, but until then its last known measurement is ~14
>> seconds, which I don't think is technically wrong.
> If I understand what you're saying, "14 secs" would not be seen as the
> correct answer by our users when the delay is now zero.
> Solving that is where the keepalives need to come into play. If no new
> WAL, send a keepalive and track the lag on that.

Hmm. Currently it works strictly with measurements of real WAL write,
flush and apply times. I rather like the simplicity of that
definition of the lag numbers, and the fact that they move only as a
result of genuine measured activity WAL. A keepalive message is never
written, flushed or applied, so if we had special cases here to show
constant 0 or measure keepalive round-trip time when we hit the end of
known WAL or something like that, the reported lag times for those
three operations wouldn't be true. In any real database cluster there
is real WAL being generated all the time, so after a big backload is
finally processed by a standby the "14 secs" won't linger for very
long, and during the time when you see that, it really is the last
true measured lag time.

I do see why a new user trying this feature for the first time might
expect it to show a lag of 0 just as soon as sent LSN =
write/flush/apply LSN or something like that, but after some
reflection I suspect that it isn't useful information, and it would be
smoke and mirrors rather than real data.

Perhaps you are thinking about the implications for alarm/monitoring
systems. If you were worried about this phenomenon you could set your
alarm condition to sent_location != replay_location AND replay_lag >
INTERVAL 'x seconds', but I'm not actually convinced that's necessary:
the worst it could do is prolong an alarm that had been correctly
triggered until some new WAL is observed being processed fast enough.
There is an argument that until you've actually made such an
observation, you don't actually know that the alarm deserves to be
shut off yet: perhaps this way avoids some flip-flopping.

> So looks like we're almost there.

Thanks for the review and ideas!

Here is a new version that is rebased on top of the recent changes
ripping out floating point timestamps. Reading those commits made it
clear to me that I should be using TimeOffset for elapsed times, not
int64, so I changed that.

Thomas Munro

Attachment Content-Type Size
replication-lag-v5.patch application/octet-stream 19.1 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Maksim Milyutin 2017-03-01 10:53:33 Proposal: Local indexes for partitioned table
Previous Message Alvaro Herrera 2017-03-01 10:45:51 Re: BRIN cost estimate