Re: Measuring replay lag

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Measuring replay lag
Date: 2017-03-05 14:22:56
Message-ID: CAMsr+YGX1tAKPvjmWoq1X8nOodenP5sCwuEfSuR21DjpYNuD0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5 March 2017 at 15:31, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 1 March 2017 at 10:47, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:

>> This seems to be problematic. Logical peers report LSN changes for
>> all three operations (write, flush, commit) only on commit. I suppose
>> that might work OK for synchronous replication, but it makes it a bit
>> difficult to get lag measurements that don't look really strange and
>> sawtoothy when you have long transactions, and overlapping
>> transactions might interfere with the measurements in odd ways.

They do, but those sawtoothy measurements are a true reflection of the
aspect of lag that's most important - what the downstream replica has
flushed to disk and made visible.

If we have xacts X1, X2 and X3, which commit in that order, then after
X1 commits the lag is the difference between X1's commit time and
"now". A peer only sends feedback updating flush position for commits.
Once X1 is confirmed replayed by the peer, the lag flush_location is
the difference between X2's later commit time and "now". And so on.

So I'd very much expect a sawtooth lag graph for flush_location,
because that's how logical replication really replays changes. We only
start replaying any xact once it commits on the upstream, and we
replay changes strictly in upstream commit order. It'll rise linearly
then fall vertically in abrupt drops.

sent_location is updated based on the last-decoded WAL position, per
src/backend/replication/walsender.c:2396 or so:

record = XLogReadRecord(logical_decoding_ctx->reader,
logical_startptr, &errm);
logical_startptr = InvalidXLogRecPtr;

/* xlog record was invalid */
if (errm != NULL)
elog(ERROR, "%s", errm);

if (record != NULL)
{
LogicalDecodingProcessRecord(logical_decoding_ctx,
logical_decoding_ctx->reader);

sentPtr = logical_decoding_ctx->reader->EndRecPtr;
}

so I would expect to see a smoother graph for sent_location based on
the last record processed by the XLogReader.

Though it's also a misleading graph, we haven't really sent that at
all, just decoded it and buffered it in a reorder buffer to send once
we decode a commit. Really, pg_stat_replication isn't quite expressive
enough to cover logical replication due to its reordering behaviour.
We can't really report the actual last LSN sent to the client very
easily, since we call into ReorderBufferCommit() and don't return
until we finish streaming the whole buffered xact, we'd need some kind
of callback to update the walsender with the lsn of the last row we
sent. And if we did this, sent_location would actually go backwards
sometimes, since usually with concurrent xacts the newest row in xact
committed at time T is newer, with higher LSN, than the oldest row in
xact with comit time T+n.

Later I'd like to add support for xact interleaving, where we can
speculatively stream rows from big in-progress xacts to the downstream
before the xact commits, and the downstream has to roll the xact back
if it aborts on the upstream. There are some snapshot management
issues to deal with there (not to mention downstream deadlock
hazards), but maybe we can limit the optimisation to xacts that made
no catalog changes to start with. I'm not at all sure how to report
sent_location then, though, or what a reasonable measure of lag will
look like.

> What we want from this patch is something that works for both, as much
> as that is possible.

If it shows a sawtooth pattern for flush lag, that's good, because
it's truthful. We can only flush after we replay commit, therefore lag
is always going to be sawtooth, with tooth size approximating xact
size and the baseline lag trend representing any sustained increase or
decrease in lag over time.

This would be extremely valuable to have.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-03-05 18:56:55 Re: REINDEX CONCURRENTLY 2.0
Previous Message Jan Michálek 2017-03-05 13:02:10 Re: Other formats in pset like markdown, rst, mediawiki