Re: Measuring replay lag

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Measuring replay lag
Date: 2017-02-23 20:05:53
Message-ID: CANP8+jJagXb4g8KGsFs5ajUh72uJx7_Q9JvOc7EHrAMLbu9bvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21 February 2017 at 21:38, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Tue, Feb 21, 2017 at 6:21 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> And happier again, leading me to move to the next stage of review,
>> focusing on the behaviour emerging from the design.
>>
>> So my current understanding is that this doesn't rely upon LSN
>> arithmetic to measure lag, which is good. That means logical
>> replication should "just work" and future mechanisms to filter
>> physical WAL will also just work. This is important, so please comment
>> if you see that isn't the case.
>
> Yes, my understanding (based on
> https://www.postgresql.org/message-id/f453caad-0396-1bdd-c5c1-5094371f4776@2ndquadrant.com
> ) is that this should in principal work for logical replication, it
> just might show the same number in 2 or 3 of the lag columns because
> of the way it reports LSNs.
>
> However, I think a call like LagTrackerWrite(SendRqstPtr,
> GetCurrentTimestamp()) needs to go into XLogSendLogical, to mirror
> what happens in XLogSendPhysical. I'm not sure about that.

Me neither, but I think we need this for both physical and logical.

Same use cases graphs for both, I think. There might be issues with
the way LSNs work for logical.

>> I think what we need to show some test results with the graph of lag
>> over time for these cases:
>> 1. steady state - pgbench on master, so we can see how that responds
>> 2. blocked apply on standby - so we can see how the lag increases but
>> also how the accuracy goes down as the lag increases and whether the
>> reported value changes (depending upon algo)
>> 3. burst mode - where we go from not moving to moving at high speed
>> and then stop again quickly
>> +other use cases you or others add
>
> Good idea. Here are some graphs. This is from a primary/standby pair
> running on my local development machine, so the times are low in the
> good cases. For 1 and 2 I used pgbench TPCB-sort-of. For 3 I used a
> loop that repeatedly dropped and created a huge table, sleeping in
> between.

Thanks, very nice

>> Does the proposed algo work for these cases? What goes wrong with it?
>> It's the examination of these downsides, if any, are the things we
>> need to investigate now to allow this to get committed.
>
> The main problem I discovered was with 2. If replay is paused, then
> the reported LSN completely stops advancing, so replay_lag plateaus.
> When you resume replay, it starts reporting LSNs advancing again and
> suddenly discovers and reports a huge lag because it advances past the
> next sample in the buffer.
>
> I realised that you had suggested the solution to this problem
> already: interpolation. I have added simple linear interpolation that
> checks if there is a future LSN in the buffer, and if so it
> interpolates linearly to synthesise the local flush time of the
> reported LSN, which is somewhere between the last and next sample's
> recorded local flush time. This seems to work well for the
> apply-totally-stopped case.

Good

> I added a fourth case 'overwhelm.png' which you might find
> interesting. It's essentially like one 'burst' followed by a 100% ide
> primary. The primary stops sending new WAL around 50 seconds in and
> then there is no autovacuum, nothing happening at all. The standby
> start is still replaying its backlog of WAL, but is sending back
> replies only every 10 seconds (because no WAL arriving so no other
> reason to send replies except status message timeout, which could be
> lowered). So we see some big steps, and then we finally see it
> flat-line around 60 seconds because there is still now new WAL so we
> keep showing the last measured lag. If new WAL is flushed it will pop
> back to 0ish, but until then its last known measurement is ~14
> seconds, which I don't think is technically wrong.

If I understand what you're saying, "14 secs" would not be seen as the
correct answer by our users when the delay is now zero.

Solving that is where the keepalives need to come into play. If no new
WAL, send a keepalive and track the lag on that.

>> Some minor points on code...
>> Why are things defined in walsender.c and not in .h?
>
> Because they are module-private.

;-) It wasn't a C question.

So looks like we're almost there.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2017-02-23 20:26:07 Re: A typo in mcxt.c
Previous Message Bruce Momjian 2017-02-23 20:05:16 bytea_output output of base64