Re: Measuring replay lag

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Measuring replay lag
Date: 2017-01-04 11:03:10
Message-ID: CAEepm=2J0VVX6wSbGtCPkdM-MenN90aa8WduNabYG6hYRP-CaQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 4, 2017 at 8:58 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 3 January 2017 at 23:22, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
>>> I don't see why that would be unacceptable. If we do it for
>>> remote_apply, why not also do it for other modes? Whatever the
>>> reasoning was for remote_apply should work for other modes. I should
>>> add it was originally designed to be that way by me, so must have been
>>> changed later.
>>
>> You can achieve that with this patch by setting
>> replication_lag_sample_interval to 0.
>
> I wonder why you ignore my mention of the bug in the correct mechanism?

I didn't have an opinion on that yet, but looking now I think there is
no bug: I was wrong about the current reply frequency. This comment
above XLogWalRcvSendReply confused me:

* If 'force' is not set, the message is only sent if enough time has
* passed since last status update to reach wal_receiver_status_interval.

Actually it's sent if 'force' is set, enough time has passed, or
either of the write or flush positions has moved. So we're already
sending replies after every write and flush, as you said we should.

So perhaps I should get rid of that replication_lag_sample_interval
GUC and send back apply timestamps frequently, as you were saying. It
would add up to a third more replies.

The effective sample rate would still be lowered when the fixed sized
buffers fill up and samples have to be dropped, and that'd be more
likely without that GUC. With the GUC, it doesn't start happening
until lag reaches XLOG_TIMESTAMP_BUFFER_SIZE *
replication_lag_sample_interval = ~2 hours with defaults, whereas
without rate limiting you might only need to get
XLOG_TIMESTAMP_BUFFER_SIZE 'w' messages behind before we start
dropping samples. Maybe that's perfectly OK, I'm not sure.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2017-01-04 11:03:24 Re: proposal: session server side variables
Previous Message Ashutosh Bapat 2017-01-04 10:59:58 Re: Reporting planning time with EXPLAIN