Re: Measuring replay lag

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Measuring replay lag
Date: 2017-02-23 06:22:38
Message-ID: CAEepm=1uHiNdE62Qgdzq5k6dKwsW1f1-cBo7vF43ZHGFddcaBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 21, 2017 at 6:21 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I think what we need to show some test results with the graph of lag
> over time for these cases:
> 1. steady state - pgbench on master, so we can see how that responds
> 2. blocked apply on standby - so we can see how the lag increases but
> also how the accuracy goes down as the lag increases and whether the
> reported value changes (depending upon algo)
> 3. burst mode - where we go from not moving to moving at high speed
> and then stop again quickly
> +other use cases you or others add

Here are graphs of the 'burst' example from my previous email, with
LAG_TRACKER_BUFFER_SIZE set to 4 (really small so that it fills up)
and 8192 (the size I'm proposing in this patch). It looks like the
resampling and interpolation work pretty well to me when the buffer is
full.

The overall graph looks pretty similar, but it is more likely to short
hiccups caused by occasional slow WAL fsyncs in walreceiver. See the
attached graphs with 'spike' in the name: in the large buffer version
we see a short spike in write/flush lag and that results in apply
falling behind, but in the small buffer version we can only guess that
that might have happened because apply fell behind during the 3rd and
4th write bursts. We don't know exactly why because we didn't have
sufficient samples to detect a short lived write/flush delay.

The workload just does this in a loop:

DROP TABLE IF EXISTS foo;
CREATE TABLE foo AS SELECT generate_series(1, 10000000);
SELECT pg_sleep(10);

While testing with a small buffer I found a thinko when write_head is
moved back, fixed in the attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
replication-lag-v4.patch application/octet-stream 19.5 KB
image/png 7.6 KB
image/png 7.7 KB
image/png 7.0 KB
image/png 6.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-02-23 06:24:34 Re: Measuring replay lag
Previous Message Amit Langote 2017-02-23 05:49:31 Re: Partitioned tables and relfilenode