Re: Timeline following for logical slots

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Timeline following for logical slots
Date: 2016-04-04 14:59:41
Message-ID: CAMsr+YF1Wi=8hAryUM1Sn=5tW64QvWav4quP1k14-M4EKTHNRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 April 2016 at 18:01, Andres Freund <andres(at)anarazel(dot)de> wrote:

> > The only way I can think of to do that really reliably right now, without
> > full failover slots, is to use the newly committed pluggable WAL
> mechanism
> > and add a hook to SaveSlotToPath() so slot info can be captured, injected
> > in WAL, and replayed on the replica.
>
> I personally think the primary answer is to use separate slots on
> different machines. Failover slots can be an extension to that at some
> point, but I think they're a secondary goal.
>

Assuming that here you mean separate slots on different machines
replicating via physical rep:

We don't currently allow the creation of a logical slot on a standby. Nor
replay from it, even to advance it without receiving the decoded changes.
Both would be required for that to work, as well as extensions to the hot
standby feedback mechanism to allow a standby to ask the master to pin its
catalog_xmin if slots on the standby were further behind than that of the
master.

I was chatting about that with Petr earlier. What we came up with was to
require the standby to connect to the master using a replication slot that,
while remaining a physical replication slot, has a catalog_xmin set and
updated by the replica using extended standby progress messages. The slot's
catalog_xmin the replica pushed up to the master would simply be the
min(catalog_xmin) of all slots on the replica,
i.e. procArray->replication_slot_catalog_xmin . Same with the slot xmin, if
defined for any slot on the replica.

That makes sure that the catalog_xmin required for the standby's slots is
preserved even if the standby isn't currently replaying from the master.

Handily this approach would give us cascading, support for intermediate
servers, and the option of only having failover slots on some replicas not
others. All things that were raised as concerns with failover slots.

However, clients would then have to know about the replica(s) of the master
that were failover candidates and would have to send feedback to advance
the client's slots on those nodes, not just the master. They'd have to be
able to connect to the replicas too. Unless we added some mechanism for the
master to lazily relay those feedback messages to replicas, anyway. Not a
major roadblock, just a bit fiddlier for clients.

Consistency shouldn't be a problem so long as the slot created on the
replica reaches SNAPBUILD_CONSISTENT (or there's enough pending WAL for it
to do so) before failover is required.

I think it'd be a somewhat reasonable alternative to failover slots and
it'd make it much more practical to decode from a replica. Which would be
great. It'd be fiddlier for clients, but probably worth it to get rid of
the limitations failover slots impose.

> > It'd also be necessary to move
> > CheckPointReplicationSlots() out of CheckPointGuts() to the start of a
> > checkpoint/restartpoint when WAL writing is still permitted, like the
> > failover slots patch does.
>
> Ugh. That makes me rather wary.
>

Your comments say it's called in CheckPointGuts for convenience... and
really there doesn't seem to be anything that makes a slot checkpoint
especially tied to a "real" checkpoint.

> > Basically, failover slots as a plugin using a hook, without the
> > additions to base backup commands and the backup label.
>
> I'm going to be *VERY* hard to convince that adding a hook inside
> checkpointing code is acceptable.
>

Yeah... it's in ReplicationSlotSave, but it's still a slot checkpoint even
if (per above) it ceases to also be in the middle of a full system
checkpoint.

> > I'd really hate 9.6 to go out with - still - no way to use logical
> decoding
> > in a basic, bog-standard HA/failover environment. It overwhelmingly
> limits
> > their utility and it's becoming a major drag on practical use of the
> > feature. That's a difficulty given that the failover slots patch isn't
> > especially trivial and you've shown that lazy sync of slot state is not
> > sufficient.
>
> I think the right way to do this is to focus on failover for logical
> rep, with separate slots. The whole idea of integrating this physical
> rep imo makes this a *lot* more complex than necessary. Not all that
> many people are going to want to physical rep and logical rep.
>

If you're saying we should focus on failover between nodes that're
themselves connected using logical replication rather than physical
replication, I really have to strongly disagree.

TL;DR for book-length below: We're a long, long way from being able to
deliver even vaguely decent logical rep based failover. Without that or
support for logical decoding to survive physical failover we've got a great
tool in logical decoding that can't be used effectively with most
real-world systems.

I originally thought logical rep based failover was the way forward too and
that mixing physical and logical rep didn't make sense.

The problem is that we're a very, very long way from there, wheras we can
deliver failover of logical decoding clients to physical standbys with
_relative_ ease and simplicity. Not actually easy or simple, but a lot
closer.

To allow logical rep and failover to be a reasonable substitute for
physical rep and failover IMO *need*:

* Robust sequence decoding and replication. If you were following the later
parts of that discussion you will've seen how fun that's going to be, but
it's the simplest of all of the problems.

* Logical decoding and sending of in-progress xacts, so the logical client
can already be most of the way through receiving a big xact when it
commits. Without this we have a huge lag spike whenever a big xact happens,
since we must first finish decoding it in to a reorder buffer and can only
then *begin* to send it to the client. During which time no later xacts may
be decoded or replayed to the client. If you're running that rare thing,
the perfect pure OLTP system, you won't care... but good luck finding one
in the real world.

* Either parallel apply on the client side or at least buffering of
in-progress xacts on the client side so they can be safely flushed to disk
and confirmed, allowing receive to continue while replay is done on the
client. Otherwise sync rep is completely impractical... and there's no
shortage of systems out there that can't afford to lose any transactions.
Or at least have some crucial transactions they can't lose.

* Robust, seamless DDL replication, so things don't just break randomly.
This makes the other points above look nice and simple by comparison.
Logical decoding of 2PC xacts with DDL would help here, as would the
ability to transparently convert an xact into a prepare-xact on client
commit and hold the client waiting while we replicate it, confirm the
successful prepare on the replica, then commit prepared on the upstream.

* oh, and some way to handle replication of shared catalog changes like
pg_authid, so the above DDL replication doesn't just randomly break if it
happens to refer to a global object that doesn't exist on the downstream.

Physical rep *works*. Robustly. Reliably. With decent performance. It's
proven. It supports sync rep. I'm confident telling people to use it.

I don't think there's any realistic way we're going to get there for
logical rep in 9.6+n for n<2 unless a whole lot more people get on board
and work on it. Even then.

Right now we can deliver logical failover for DBs that:

(a) only use OWNED BY sequences like SERIAL, and even then only with some
hacks;
(b) don't do DDL, ever, or only do some limited DDL via direct admin
commands where they can call some kind of helper function to queue and
apply the DDL;
(c) don't do big transactions or don't care about unbounded lag;
(d) don't need synchronous replication or don't care about possibly large
delays before commit is confirmed;
(e) only manage role creation (among other things) via very strict
processes that can be absolutely guaranteed to run on all nodes

... which in my view isn't a great many databases.

Physical rep has *none* of those problems. (Sure, it has others, but we're
used to them). It does lack any way for a logical client to follow failover
though, meaning that right now it's really hard to use logical rep in
conjunction with physical rep. Anything you use logical decoding for has to
be able to cope with completely breaking and being resynced from a new
snapshot after failover, which removes a lot of the advantages the
reliable, ordered replay from logical decoding gives us in the first place.

Just one example: right now BDR doesn't like losing nodes without warning,
as you know. We want to add support for doing recovery replay from the
most-ahead peer of a lost node to help with that, though the conflict
handling implications of that could be interesting. But even then replacing
the lost node still hurts when you're working over a WAN. In real world
case I've dealt with it took over 8 hours to bring up a replacement for a
lost node over the WAN after the original node's host suffered an abrupt
hardware failure. If we could've just had a physical sync standby for each
BDR node running locally on the same LAN as the main node (and dealt with
the fun issues around the use of the physical timeline in BDR's node
identity keys) we could've just promoted it to replace the lost node with
minimal disruption.

You could run two nodes on each site, but then you either double your WAN
bandwidth use or have to support complex non-mesh topologies with logical
failover-candidate standbys hanging off each node in the mesh.

That's just BDR though. You can't really use logical replication for things
like collecting an audit change stream, feeding business message buses and
integration systems, replicating to non-PostgreSQL databases, etc, if you
can't point it at a HA upstream and expect it to still work after failover.
Since we can't IMO deliver logical replication based HA in a reasonable
timeframe, that means we should really have a way for logical slots to
follow physical failover.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2016-04-04 15:03:58 Re: Choosing parallel_degree
Previous Message Tom Lane 2016-04-04 14:48:56 Re: Tiny patch: sigmask.diff