| From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> | 
|---|---|
| To: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> | 
| Cc: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: Measuring replay lag | 
| Date: | 2017-02-14 11:48:41 | 
| Message-ID: | CAEepm=150d57UB82XgksbWwqbJjiydEZK=SPKibOfpHjMy1ovw@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Wed, Feb 1, 2017 at 5:21 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Sat, Jan 21, 2017 at 10:49 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> Ok.  I see that there is a new compelling reason to move the ring
>> buffer to the sender side: then I think lag tracking will work
>> automatically for the new logical replication that just landed on
>> master.  I will try it that way.  Thanks for the feedback!
>
> Seeing no new patches, marked as returned with feedback. Feel free of
> course to refresh the CF entry once you have a new patch!
Here is a new version with the buffer on the sender side as requested.
Since it now shows write, flush and replay lag, not just replay, I
decide to rename it and start counting versions at 1 again.
replication-lag-v1.patch is less than half the size of
replay-lag-v16.patch and considerably simpler.  There is no more GUC
and no more protocol change.
While the write and flush locations are sent back at the right times
already, I had to figure out how to get replies to be sent at the
right time when WAL was replayed too.  Without doing anything special
for that, you get the following cases:
1.  A busy system: replies flow regularly due to write and flush
feedback, and those replies include replay position, so there is no
problem.
2.  A system that has just streamed a lot of WAL causing the standby
to fall behind in replaying, but the primary is now idle:  there will
only be replies every 10 seconds (wal_receiver_status_interval), so
pg_stat_replication.replay_lag only updates with that frequency.
(That was already the case for replay_location).
3.  An idle system that has just replayed some WAL and is now fully
caught up.  There is no reply until the next
wal_receiver_status_interval; so now replay_lag shows a bogus number
over 10 seconds.  Oops.
Case 1 is good, and I suppose that 2 is OK, but I needed to do
something about 3.  The solution I came up with was to force one reply
to be sent whenever recovery runs out of WAL to replay and enters
WaitForWALToBecomeAvailable().  This seems to work pretty well in
initial testing.
Thoughts?
-- 
Thomas Munro
http://www.enterprisedb.com
| Attachment | Content-Type | Size | 
|---|---|---|
| replication-lag-v1.patch | application/octet-stream | 15.7 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Simon Riggs | 2017-02-14 11:52:12 | Re: Measuring replay lag | 
| Previous Message | Kyotaro HORIGUCHI | 2017-02-14 10:51:52 | Re: IF (NOT) EXISTS in psql-completion |