Re: WIP: Failover Slots

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Failover Slots
Date: 2017-08-14 03:56:28
Message-ID: CAMsr+YEupkmQR2zogMqySeskbNv24AhymWMQHbguJzQxCvZrow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12 August 2017 at 08:03, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2017-08-02 16:35:17 -0400, Robert Haas wrote:
> > I actually think failover slots are quite desirable, especially now
> > that we've got logical replication in core. In a review of this
> > thread I don't see anyone saying otherwise. The debate has really
> > been about the right way of implementing that.
>
> Given that I presumably was one of the people pushing back more
> strongly: I agree with that. Besides disagreeing with the proposed
> implementation our disagreements solely seem to have been about
> prioritization.
>
> I still think we should have a halfway agreed upon *design* for logical
> failover, before we introduce a concept that's quite possibly going to
> be incompatible with that, however. But that doesn't mean it has to
> submitted/merged to core.
>

How could it be incompatible? The idea here is to make physical failover
transparent to logical decoding clients. That's not meant to sound
confrontational, I mean that I can't personally see any way it would be and
could use your ideas.

I understand that it might be *different* and you'd like to see more
closely aligned approaches that work more similarly. For which we first
need to know more clearly how logical failover will look. But it's hard not
to also see this as delaying and blocking until your preferred approach via
pure logical rep and logical failover gets in, and physical failover can be
dismissed with "we don't need that anymore". I'm sure that's not your
intent, I just struggle not to see it that way anyway when there's always
another reason not to proceed to solve this problem because of a loosely
related development effort on another problem.

I think there's a couple design goals we need to agree upon, before
> going into the weeds of how exactly we want this to work. Some of the
> axis I can think of are:
>
> - How do we want to deal with cascaded setups, do slots have to be
> available everywhere, or not?
>

Personally, I don't care either way.

> - What kind of PITR integration do we want? Note that simple WAL based
> slots do *NOT* provide proper PITR support, there's not enough
> interlock easily available (you'd have to save slots at the end, then
> increment minRecoveryLSN to a point later than the slot saving)
>

Interesting. I haven't fully understood this, but think I see what you're
getting at.

As outlined in the prior mail, I'd like to have working PITR with logical
slots but think it's pretty niche as it can't work usefully without plenty
of co-operation from the rest of the logical replication software in use.
You can't just restore and resume normal operations. So I don't think it's
worth making it a priority.

It's possible to make PITR safe with slots by blocking further advance of
catalog_xmin on the running master for the life of the PITR base backup
using a slot for retention. There's plenty of room for operator error
until/unless we add something like catalog_xmin advance xlog'ing, but it
can be done now with external tools if you're careful. Details in the prior
mail.

I don't think PITR for logical slots is important given there's a
workaround and it's not simple to actually do anything with it if you have
it.

> - How much divergence are we going to accept between logical decoding on
> standbys, and failover slots. I'm probably a lot closer to closer than
> than Craig is.
>

They're different things to me, but I think you're asking "to what extent
should failover slots functionality be implemented strictly on top of
decoding on standby?"

"Failover slots" provides a mechanism by which a logical decoding client
can expect a slot it creates on a master (or physical streaming replica
doing decoding on standby) to continue to exist. The client can ignore
physical HA and promotions of the master, which can continue to be managed
using normal postgres tools. It's the same as, say, an XA transaction
manager expecting that if your master dies and you fail over to a standby,
the TM should't have to have been doing special housekeeping on the
promotion candidate before promotion in order for 2PC to continue to work.
It Just Works.

Logical decoding on standby is useful with, or without, failover slots, as
you can use it to extract data from a replica, and now decoding timeline
following is in a decoding connection on a replica will survive promotion
to master.

But in addition to its main purpose of allowing logical decoding from a
standby server to offload work, it can be used to implement client-managed
support for failover to physical replicas. For this, the client must have
an inventory of promotion-candidates of the master and their connstrings so
it can maintain slots on them too. The client must be able to connect to
all promotion-candidates and advance their slots via decoding along with
the master slots it's actually replaying from. If a client isn't "told"
about a promotion candidate, decoding will break when we fail over. If a
client cannot connect to a promotion candidate, catalog_xmin will fall
behind on master until the replica is discarded (and its physical slot
dropped) or the client regains access. Every different logical decoding
client application must implement all this logic and management separately.

It may be possible to implement failover-slots like functionality based on
decoding on standby in an app transparent way, by having the replica
monitor slot states on the master and self-advance its own slots by
loopback decoding connection. Or the master could maintain an inventory of
replicas and make decoding connections to them where it advances their
slots after the masters' slots are advanced by an app. But either way, why
would we want to do this? Why actually decode WAL and use the logical
decoding machinery when we *know* the state of the system because only the
master is writeable?

The way I see it, to provide failover slots functionality we'd land up with
something quite similar to what Robert and I just discussed, but the slot
advance would be implemented using decoding (on standby) instead of
directly setting slot state. What benefit does that offer?

I don't want to block failover slots on decoding on standby just because
decoding on standby would be nice to have.

> - How much divergence are we going to accept between infrastructure for
> logical failover, and logical failover via failover slots (or however
> we're naming this)? Again, I'm probably a lot closer to zero than
> craig is.

We don't have logical failover, let alone mature, tested logical failover
that covers most of Pg's available functionality. Nor much of a design for
it AFAIK. There is no logical failover to diverge from, and I don't want to
block physical failover support on that.

But, putting that aside to look at the details of how logical failover
might work, what sort of commonality do you expect to see? Physical
failover is by WAL replication using archive recovery/streaming, managed
via recovery.conf, with unilateral promotion by trigger file/command. The
admin is expected to ensure that any clients and cascading replicas get
redirected to the promoted node and the old one is fenced - and we don't
care if that's done by IP redirection or connstring updates or what. Per
the proposal Robert and I discussed, logical slots will be managed by
having the walsender/walreceiver exchange slot state information that
cascades up/down the replication tree via mirror slot creations.

How's logical replica promotion going to work? Here's one possible way, of
many: the promotion-candidate logical replica consumes an unfiltered xact
stream that contains changes from all nodes, not just its immediate
upstream. Downstreams of the master can maintain direct connections to the
promotion candidate and manage their own slots directly, sending flush
confirmations for slots on the promotion candidate as they see their
decoding sessions on the replica decode commits for LSNs the clients sent
flush confirmations to the master for. On promotion, the master's
downstreams would be reconfigured to connect to the node-id of the newly
promoted master and would begin decoding from it in catchup mode, where
they receive the commits from the old master via the new master, until they
reach the new master's end-of-wal at time of promotion. With some tweaks
like a logical WAL message recording the moment of promotion, it's not that
different to the client-managed physical failover model.

It can also be converted to a more transparent failover-slots like model by
having the promotion candidate physical replica clone slots from its
upstream, but advance them by loopback decoding - not necessarily actual
network loopback. It'd use a filter that discards data and only sees the
commit XIDs + LSNs. It'd send confirmations on the slots when the local
slot processed a commit for which the upstream's copy of the slot had a
confirmation for that lsn. On promotion, replicas would connect with new
replorigins (0) and let decoding start at the slot positions on the
replica. The master->replica slot state reporting can be done via the
walsender too, just as proposed for the physical case, though no
replica->master reporting would be needed for logical failover.

So despite my initial expectations they can be moderately similar in broad
structure. But I don't think there's going to be much actual code overlap
beyond minor things like both wanting a way to query slot state on the
upstream. Both *could* use decoding on standby to advance slot positions,
but for the physical case that's just a slower (and unfinished) way to do
what we already have, wheras it's necessary for logical failover.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rushabh Lathia 2017-08-14 04:40:33 Re: reload-through-the-top-parent switch the partition table
Previous Message Michael Paquier 2017-08-14 03:20:41 Simplify ACL handling for large objects and removal of superuser() checks