Re: Timeline following for logical slots

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Timeline following for logical slots
Date: 2016-04-04 09:50:02
Message-ID: CAMsr+YEjJNF0ewBQzsn0_MkTF3P2ZntVc4GKSG1K8NR-PkpjdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1 April 2016 at 14:46, Andres Freund <andres(at)anarazel(dot)de> wrote:

> > However: It's safe for the slot state on the replica to be somewhat
> behind
> > the local replay from the master, so the slot state on the replica is
> older
> > than what it would've been at an equivalent time on the master... so long
> > as it's not so far behind that the replica has replayed vacuum activity
> > that removes rows still required by the slot's declared catalog_xmin.
> Even
> > then it should actually be OK in practice because the failing-over client
> > will always have replayed past that point on the master (otherwise
> > catalog_xmin couldn't have advanced on the master), so it'll always ask
> to
> > start replay at a point at or after the lsn where the catalog_xmin was
> > advanced to its most recent value on the old master before failover. It's
> > only safe for walsender based decoding though, since there's no way to
> > specify the startpoint for sql-level decoding.
>
> This whole logic fails entirely flats on its face by the fact that even
> if you specify a startpoint, we read older WAL and process the
> records. The startpoint filters every transaction that commits earlier
> than the "startpoint". But with a long-running transaction you still
> will have old changes being processed, which need the corresponding
> catalog state.
>

Right. Having read over those draft docs I see what you mean.

TL;DR: the only way I can see to do this now is full-on failover slots, or
something very similar that uses pluggable WAL + a slot save hook + moving
CheckPointReplicationSlots to the start of a checkpoint while WAL is still
writeable.

To rephrase per my understanding: The client only specifies the point it
wants to start seeing decoded commits. Decoding starts from the slot's
restart_lsn, and that's the point from which the accumulation of reorder
buffer contents begins, the snapshot building process begins, and where
accumulation of relcache invalidation information begins. At restart_lsn no
xact that is to be emitted to the client may yet be in progress. Decoding,
whether or not the xacts will be fed to the output plugin callbacks,
requires access to the system catalogs. Therefore catalog_xmin reported by
the slot must be >= the real effective catalog_xmin of the heap and valid
at the restart_lsn, not just the confirmed flush point or the point the
client specifies to resume fetching changes from.

On the original copy of the slot on the pre-failover master the restart_lsn
would've been further ahead, as would the catalog_xmin. So catalog rows
have been purged. When we decode from the old restart_lsn on the replica
after failover we're not just doing excess work, we'd be accumulating
probably-incorrect invalidation information for the xacts that we're
decoding (but not sending to the client). Or just failing to find entries
we expect to find in the caches and dying.

If the restart_lsn was advanced on the master there must be another safe
decoding restart point after the one the old copy of the slot points to.
But we don't know where, and we don't have any way to know that we *are* in
such a post-failover situation so we can't just scan forward looking for it
without looking up invalidation info for xacts we know we'll be throwing
away.

So it's necessary to ensure that the slot's restart_lsn and catalog_xmin
are advanced in a timely, consistent manner on the replica's copy of the
slot at a point where no vacuum changes to the catalog that could remove
needed tuples have been replayed.

The only way I can think of to do that really reliably right now, without
full failover slots, is to use the newly committed pluggable WAL mechanism
and add a hook to SaveSlotToPath() so slot info can be captured, injected
in WAL, and replayed on the replica. It'd also be necessary to move
CheckPointReplicationSlots() out of CheckPointGuts() to the start of a
checkpoint/restartpoint when WAL writing is still permitted, like the
failover slots patch does. Basically, failover slots as a plugin using a
hook, without the additions to base backup commands and the backup label.
I'm not convinced that's at all better than the full failover slots, and
I'm not sure I can make a solid argument for the general purpose utility of
the hook it'd need, but at least it'd avoid committing core to that
approach.

I'd really hate 9.6 to go out with - still - no way to use logical decoding
in a basic, bog-standard HA/failover environment. It overwhelmingly limits
their utility and it's becoming a major drag on practical use of the
feature. That's a difficulty given that the failover slots patch isn't
especially trivial and you've shown that lazy sync of slot state is not
sufficient.

> It's also OK for the slot state to be *ahead* of the replica's replay
> > position, i.e. from the future. restart_lsn and catalog_xmin will be
> higher
> > than they were on the master at the same point in time, but that doesn't
> > matter at all, since the client will always be requesting to start from
> > that point or later, having replayed up to there on the old master before
> > failover.
>
> I don't think that's true. See my point above about startpoint.
>

My explanation is certainly wrong. I suspect the result of doing it is
still actually OK, though (though pretty useless, per below).

The restart_lsn from the newer copy of the slot is, as you said, a point we
know we can reconstruct visibility info. Also a point where no xact that we
want to decode was already in-progress, since we have to accumulate those
in the reorder buffer. The client will be requesting (via its local
replication origin progress tracking or whatever) that logical decoding
start emitting xact contents at a point that would've been reasonable with
the newer restart_lsn from the copied slot, since otherwise it wouldn't
have advanced on the master. There's no problem with missing catalog
entries, we'll have extras we don't need until vacuum catches up but that's
all.

The client's requested start point should always be reasonable w.r.t. the
latest copy of the master's slot, so long as the replica has replayed up to
or past the point the client wants to start receiving decoded commits. Nor
should any resources needed have been removed from the replica.

So long as the confirmed lsn (and whatever point the client requests replay
resume from) is less than the timeline switchpoint to the current timeline
it should be OK.

If I'm right about that then syncing slot state over such that the slot
data is always the same as or newer than it was on the master at the same
point in history should be OK. We could grab the master's xlog insert ptr,
snapshot its slots, sync 'em over, and advance the recovery target lsn on
the replica to allow it to replay up to that point in the master's history,
repeating periodically. However, it'd be hideous and it'd be completely
impossible to use for sync rep, so it's pretty useless even if it would
work.

> > I think the comment would be less necessary, and could be moved up to the
> > CreateDecodingContext call, if we passed the slot's confirmed_flush
> > explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr
> > and having the CreateDecodingContext call infer the start point. That way
> > what's going on would be a lot clearer when reading the code.
>
> How would that solve anything? We still need to process the old records,
> hence the xlogreader needs to instructed to read them. Hence
> logicalfuncs.c needs to know about it.
>

Huh?

All that'd do would be making what already happens more visible. Instead of
passing InvalidXLogRecPtr as start_lsn having it transparently replaced
with the confirmed_lsn inside CreateDecodingContext, we'd pass the
slot->data.confirmed_lsn directly.

IMO it'd just make it a little easier to follow what's going on.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-04-04 10:01:16 Re: Timeline following for logical slots
Previous Message Fabien COELHO 2016-04-04 09:47:47 Re: pgbench more operators & functions