Re: Measuring replay lag

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Measuring replay lag
Date: 2016-11-08 01:35:23
Message-ID: CAD21AoAC2BB18ZZ45MA8V=CzqZhHkru71U0hirm8QVqOJL=z-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 26, 2016 at 7:34 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> Hi hackers,
>
> Here is a new version of my patch to add a replay_lag column to the
> pg_stat_replication view (originally proposed as part of a larger
> patch set for 9.6[1]), like this:

Thank you for working on this!

> postgres=# select application_name, replay_lag from pg_stat_replication;
> ┌──────────────────┬─────────────────┐
> │ application_name │ replay_lag │
> ├──────────────────┼─────────────────┤
> │ replica1 │ 00:00:00.595382 │
> │ replica2 │ 00:00:00.598448 │
> │ replica3 │ 00:00:00.541597 │
> │ replica4 │ 00:00:00.551227 │
> └──────────────────┴─────────────────┘
> (4 rows)
>
> It works by taking advantage of the { time, end-of-WAL } samples that
> sending servers already include in message headers to standbys. That
> seems to provide a pretty good proxy for when the WAL was written, if
> you ignore messages where the LSN hasn't advanced. The patch
> introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to
> control how often the standby should record these timestamped LSNs
> into a small circular buffer. When its recovery process eventually
> replays a timestamped LSN, the timestamp is sent back to the upstream
> server in a new reply message field. The value visible in
> pg_stat_replication.replay_lag can then be updated by comparing with
> the current time.

replay_lag_sample_interval is 1s by default but I got 1000s by SHOW command.
postgres(1:36789)=# show replay_lag_sample_interval ;
replay_lag_sample_interval
----------------------------
1000s
(1 row)

Also, I set replay_lag_sample_interval = 500ms, I got 0 by SHOW command.
postgres(1:99850)=# select name, setting, applied from
pg_file_settings where name = 'replay_lag_sample_interval';
name | setting | applied
----------------------------+---------+---------
replay_lag_sample_interval | 500ms | t
(1 row)

postgres(1:99850)=# show replay_lag_sample_interval ;
replay_lag_sample_interval
----------------------------
0
(1 row)

> Compared to the usual techniques people use to estimate replay lag,
> this approach has the following advantages:
>
> 1. The lag is measured in time, not LSN difference.
> 2. The lag time is computed using two observations of a single
> server's clock, so there is no clock skew.
> 3. The lag is updated even between commits (during large data loads etc).

I agree with this approach.

> In the previous version I was effectively showing the ping time
> between the servers during idle times when the standby was fully
> caught up because there was nothing happening. I decided that was not
> useful information and that it's more newsworthy and interesting to
> see the estimated replay lag for the most recent real replayed
> activity, so I changed that.
>
> In the last thread[1], Robert Haas wrote:
>> Well, one problem with this is that you can't put a loop inside of a
>> spinlock-protected critical section.
>
> Fixed.
>
>> In general, I think this is a pretty reasonable way of attacking this
>> problem, but I'd say it's significantly under-commented. Where should
>> someone go to get a general overview of this mechanism? The answer is
>> not "at place XXX within the patch". (I think it might merit some
>> more extensive documentation, too, although I'm not exactly sure what
>> that should look like.)
>
> I have added lots of comments.
>
>> When you overflow the buffer, you could thin in out in a smarter way,
>> like by throwing away every other entry instead of the oldest one. I
>> guess you'd need to be careful how you coded that, though, because
>> replaying an entry with a timestamp invalidates some of the saved
>> entries without formally throwing them out.
>
> Done, by overwriting the newest sample rather than the oldest if the
> buffer is full. That seems to give pretty reasonable degradation,
> effectively lowering the sampling rate, without any complicated buffer
> or rate management code.
>
>> Conceivably, 0002 could be split into two patches, one of which
>> computes "stupid replay lag" considering only records that naturally
>> carry timestamps, and a second adding the circular buffer to handle
>> the case where much time passes without finding such a record.
>
> I contemplated this but decided that it'd be best to use ONLY samples
> from walsender headers, and never use the time stamps from commit
> records for this. If we use times from commit records, then a
> cascading sending server will not be able to compute the difference in
> time without introducing clock skew (not to mention the difficulty of
> combining timestamps from two sources if we try to do both). I
> figured that it's better to have value that shows a cascading
> sender->standby->cascading sender round trip time that is free of
> clock skew, than a master->cascading sender->standby->cascading sender
> incomplete round trip that includes clock skew.
>
> By the same reasoning I decided against introducing a new periodic WAL
> record or field from the master to hold extra time stamps in between
> commits to do this, in favour of the buffered transient timestamp
> approach I took in this patch.

I think that you need to change sendFeedback() in pg_recvlogical.c and
receivexlog.c as well.

> That said, I can see there are
> arguments for doing it with extra periodic WAL timestamps, if people
> don't think it'd be too invasive to mess with the WAL for this, and
> don't care about cascading standbys giving skewed readings. One
> advantage would be that persistent WAL timestamps would still be able
> to provide lag estimates if a standby has been down for a while and
> was catching up, and this approach can't until it's caught up due to
> lack of buffered transient timestamps. Thoughts?
>
> I plan to post a new "causal reads" patch at some point which will
> depend on this, but in any case I think this is a useful feature on
> its own. I'd be grateful for any feedback, flames, better ideas etc.
> Thanks for reading.
>
> [1] https://www.postgresql.org/message-id/CAEepm%3D31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w%40mail.gmail.com
>

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-11-08 01:37:14 Re: Re: BUG #13755: pgwin32_is_service not checking if SECURITY_SERVICE_SID is disabled
Previous Message Michael Paquier 2016-11-08 01:33:28 Re: Re: BUG #13755: pgwin32_is_service not checking if SECURITY_SERVICE_SID is disabled