Re: WIP: Failover Slots

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Failover Slots
Date: 2017-08-10 06:38:46
Message-ID: CAMsr+YGfaT1N_0cofrDh8ePu604GRnziweJLYQjTk=O3zH=uog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9 August 2017 at 23:42, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer <craig(at)2ndquadrant(dot)com>
> wrote:
> >> - When a standby connects to a master, it can optionally supply a list
> >> of slot names that it cares about.
> >
> > Wouldn't that immediately exclude use for PITR and snapshot recovery? I
> have
> > people right now who want the ability to promote a PITR-recovered
> snapshot
> > into place of a logical replication master and have downstream peers
> replay
> > from it. It's more complex than that, as there's a resync process
> required
> > to recover changes the failed node had sent to other peers but isn't
> > available in the WAL archive, but that's the gist.
> >
> > If you have a 5TB database do you want to run an extra replica or two
> > because PostgreSQL can't preserve slots without a running, live replica?
> > Your SAN snapshots + WAL archiving have been fine for everything else so
> > far.
>
> OK, so what you're basically saying here is that you want to encode
> the failover information in the write-ahead log rather than passing it
> at the protocol level, so that if you replay the write-ahead log on a
> time delay you get the same final state that you would have gotten if
> you had replayed it immediately. I hadn't thought about that
> potential advantage, and I can see that it might be an advantage for
> some reason, but I don't yet understand what the reason is. How would
> you imagine using any version of this feature in a PITR scenario? If
> you PITR the master back to an earlier point in time, I don't see how
> you're going to manage without resyncing the replicas, at which point
> you may as well just drop the old slot and create a new one anyway.
>

I've realised that it's possible to work around it in app-space anyway. You
create a new slot on a node before you snapshot it, and you don't drop this
slot until you discard the snapshot. The existence of this slot ensures
that any WAL generated by the node (and replayed by PITR after restore)
cannot clobber needed catalog_xmin. If we xlog catalog_xmin advances or
have some other safeguard in place, which we need for logical decoding on
standby to be safe anyway, then we can fail gracefully if the user does
something dumb.

So no need to care about this.

(What I wrote previously on this was):

You definitely can't just PITR restore and pick up where you left off.

You need a higher level protocol between replicas to recover. For example,
in a multi-master configuration, this can be something like (simplified):

* Use the timeline history file to find the lsn at which we diverged from
our "future self", the failed node
* Connect to the peer and do logical decoding, with a replication origin
filter for "originating from me", for xacts from the divergence lsn up to
the peer's current end-of-wal.
* Reset peer's replication origin for us to our new end-of-wal, and resume
replication

To enable that to be possible, since we can't rewind slots once confirmed
advanced, maintain a backup slot on the peer corresponding to the
point-in-time at which a snapshot was taken.

For most other situations there is little benefit vs just re-creating the
slot before you permit write user-initiated write xacts to begin on the
restored node.

I can accept an argument that "we" as pgsql-hackers do not consider this
something worth caring about, should that be the case. It's niche enough
that you could argue it doesn't have to be supportable in stock postgres.

Maybe you're thinking of a scenario where we PITR the master and also
> use PITR to rewind the replica to a slightly earlier point?

That can work, but must be done in lock-step. You have to pause apply on
both ends for long enough to snapshot both, otherwise the replicaion
origins on one end get out of sync with the slots on another.

Interesting, but I really hope nobody's going to need to do it.

> But I
> can't quite follow what you're thinking about. Can you explain
> further?
>

Gladly.

I've been up to my eyeballs in this for years now, and sometimes it becomes
quite hard to see the outside perspective, so thanks for your patience.

>
> > Requiring live replication connections could also be an issue for service
> > interruptions, surely? Unless you persist needed knowledge in the
> physical
> > replication slot used by the standby to master connection, so the master
> can
> > tell the difference between "downstream went away for while but will come
> > back" and "downstream is gone forever, toss out its resources."
>
> I don't think the master needs to retain any resources on behalf of
> the failover slot. If the slot has been updated by feedback from the
> associated standby, then the master can toss those resources
> immediately. When the standby comes back on line, it will find out
> via a protocol message that it can fast-forward the slot to whatever
> the new LSN is, and any WAL files before that point are irrelevant on
> both the master and the standby.
>

OK, so you're envisioning that every slot on a downstream has a mirror slot
on the upstream, and that is how the master retains the needed resources.

> > Also, what about cascading? Lots of "pull" model designs I've looked at
> tend
> > to fall down in cascaded environments. For that matter so do failover
> slots,
> > but only for the narrower restriction of not being able to actually
> decode
> > from a failover-enabled slot on a standby, they still work fine in terms
> of
> > cascading down to leaf nodes.
>
> I don't see the problem. The cascaded standby tells the standby "I'm
> interested in the slot called 'craig'" and the standby says "sure,
> I'll tell you whenever 'craig' gets updated" but it turns out that
> 'craig' is actually a failover slot on that standby, so that standby
> has said to the master "I'm interested in the slot called 'craig'" and
> the master is therefore sending updates to that standby. Every time
> the slot is updated, the master tells the standby and the standby
> tells the cascaded standby and, well, that all seems fine.
>

Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1
correspondence, and using globally unique slot names to manage state.

If slot names collide, you presumably fail with "er, don't do that then".
Or scrambling data horribly. Both of which we certainly have precedent for
in Pg (see, e.g, what happens if two snapshots of the same node are in
archive recovery and promote to the same timeline, then start archiving to
the same destination...). So not a showstopper.

I'm pretty OK with that.

Also, as Andres pointed out upthread, if the state is passed through
> the protocol, you can have a slot on a standby that cascades to a
> cascaded standby; if the state is passed through the WAL, all slots
> have to cascade from the master.

Yes, that's my main hesitation with the current failover slots, as
mentioned in the prior message.

> Generally, with protocol-mediated
> failover slots, you can have a different set of slots on every replica
> in the cluster and create, drop, and reconfigure them any time you
> like. With WAL-mediated slots, all failover slots must come from the
> master and cascade to every standby you've got, which is less
> flexible.
>

Definitely agreed.

Different standbys don't know about each other so it's the user's job to
make sure they ensure uniqueness, using slot name as a key.

I don't want to come on too strong here. I'm very willing to admit
> that you may know a lot more about this than me and I am really
> extremely happy to benefit from that accumulated knowledge.

The flip side is that I've also been staring at the problem, on and off,
for WAY too long. So other perspectives can be really valuable.

> If you're
> saying that WAL-mediated slots are a lot better than protocol-mediated
> slots, you may well be right, but I don't yet understand the reasons,
> and I want to understand the reasons. I think this stuff is too
> important to just have one person saying "here's a patch that does it
> this way" and everybody else just says "uh, ok". Once we adopt some
> proposal here we're going to have to continue supporting it forever,
> so it seems like we'd better do our best to get it right.

I mostly agree there. We could have relatively easily converted WAL-based
failover slots to something else in a major version bump, and that's why I
wanted to get them in place for 9.6 and then later for pg10. Because people
were (and are) constantly asking me and others who work on logical
replication tools why it doesn't work, and a 90% solution that doesn't
paint us into a corner seemed just fine.

I'm quite happy to find a better one. But I cannot spend a lot of time
writing something to have it completely knocked back because the scope just
got increased again and now it has to do more, so it needs another rewrite.

So, how should this look if we're using the streaming rep protocol?

How about:

A "failover slot" is identified by a field in the slot struct and exposed
in pg_replication_slots. It can be null (not a failover slots). It can
indicate that the slot was created locally and is "owned" by this node; all
downstreams should mirror it. It can also indicate that it is a mirror of
an upstream, in which case clients may not replay from it until it's
promoted to an owned slot and ceases to be mirrored. Attempts to replay
from a mirrored slot just ERROR and will do so even once decoding on
standby is supported.

This promotion happens automatically if a standby is promoted to a master,
and can also be done manually via sql function call or walsender command to
allow for an internal promotion within a cascading replica chain.

When a replica connects to an upstream it asks via a new walsender msg
"send me the state of all your failover slots". Any local mirror slots are
updated. If they are not listed by the upstream they are known deleted, and
the mirror slots are deleted on the downstream.

The upstream walsender then sends periodic slot state updates while
connected, so replicas can advance their mirror slots, and in turn send
hot_standby_feedback that gets applied to the physical replication slot
used by the standby, freeing resources held for the slots on the master.

There's one big hole left here. When we create a slot on a cascading leaf
or inner node, it takes time for hot_standby_feedback to propagate the
needed catalog_xmin "up" the chain. Until the master has set the needed
catalog_xmin on the physical slot for the closest branch, the inner node's
slot's catalog_xmin can only be tentative pending confirmation. That's what
a whole bunch of gruesomeness in the decoding on standby patch was about.

One possible solution to this is to also mirror slots "up", as you alluded
to: when you create an "owned" slot on a replica, it tells the master at
connect time / slot creation time "I have this slot X, please copy it up
the tree". The slot gets copied "up" to the master via cascading layers
with a different failover slot type indicating it's an up-mirror. Decoding
clients aren't allowed to replay from an up-mirror slot and it cannot be
promoted like a down-mirror slot can, it's only there for resource
retention. A node knows its owned slot is safe to actually use, and is
fully created, when it sees the walsender report it in the list of failover
slots from the master during a slot state update.

This imposes some restrictions:

* failover slot names must be globally unique or things go "kaboom"
* if a replica goes away, its up-mirror slots stay dangling until the admin
manually cleans them up

Tolerable, IMO. But we could fix the latter by requiring that failover
slots only be enabled when the replica uses a physical slot to talk to the
upstream. The up-mirror failover slots then get coupled to the physical
slot by an extra field in the slot struct holding the name of the owning
physical slot. Dropping that physical slot cascade-drops all up-mirror
slots automatically. Admins are prevented from dropping up-mirror slots
manually, which protects against screwups.

We could even fix the naming, maybe, with some kind of qualified naming
based on the physical slot, but it's not worth the complexity.

It sounds a bit more complex than your sketch, but I think the 4
failover-kinds are necessary to support this. We'll have:

* not a failover slot, purely local

* a failover slot owned by this node (will be usable for decoding on
standby once supported)

* an up-mirror slot, not promoteable, resource retention only, linked to a
physical slot for a given replica

* a down-mirror slot, promoteable, not linked to a physical slot; this is
the true "failover slot"'s representation on a replica.

Thoughts? Feels pretty viable to me.

Thanks for the new perspective.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Seltenreich 2017-08-10 06:50:59 Re: Server crash (FailedAssertion) due to catcache refcount mis-handling
Previous Message Ashutosh Bapat 2017-08-10 06:23:04 Re: Partition-wise join for join between (declaratively) partitioned tables