Re: WIP: Failover Slots

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Failover Slots
Date: 2017-08-08 08:00:43
Message-ID: CAMsr+YH0y2V9615s7Aeedzs3DvgrWBt28LP2K3FQi6uKKfJjMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3 August 2017 at 04:35, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Jul 25, 2017 at 8:44 PM, Craig Ringer <craig(at)2ndquadrant(dot)com>
> wrote:
> > No. The whole approach seems to have been bounced from core. I don't
> agree
> > and continue to think this functionality is desirable but I don't get to
> > make that call.
>
> I actually think failover slots are quite desirable, especially now
> that we've got logical replication in core. In a review of this
> thread I don't see anyone saying otherwise. The debate has really
> been about the right way of implementing that. Suppose we did
> something like this:
>
> - When a standby connects to a master, it can optionally supply a list
> of slot names that it cares about.
>

Wouldn't that immediately exclude use for PITR and snapshot recovery? I
have people right now who want the ability to promote a PITR-recovered
snapshot into place of a logical replication master and have downstream
peers replay from it. It's more complex than that, as there's a resync
process required to recover changes the failed node had sent to other peers
but isn't available in the WAL archive, but that's the gist.

If you have a 5TB database do you want to run an extra replica or two
because PostgreSQL can't preserve slots without a running, live replica?
Your SAN snapshots + WAL archiving have been fine for everything else so
far.

Requiring live replication connections could also be an issue for service
interruptions, surely? Unless you persist needed knowledge in the physical
replication slot used by the standby to master connection, so the master
can tell the difference between "downstream went away for while but will
come back" and "downstream is gone forever, toss out its resources."

That's exactly what the catalog_xmin hot_standby_feedback patches in Pg10
do, but they can only tell the master about the oldest resources needed by
any existing slot on the replica. Not which slots. And they have the same
issues with needing a live, running replica.

Also, what about cascading? Lots of "pull" model designs I've looked at
tend to fall down in cascaded environments. For that matter so do failover
slots, but only for the narrower restriction of not being able to actually
decode from a failover-enabled slot on a standby, they still work fine in
terms of cascading down to leaf nodes.

- The master responds by periodically notifying the standby of changes
> to the slot contents using some new replication sub-protocol message.
> - The standby applies those updates to its local copies of the slots.
>

That's pretty much what I expect to have to do for clients to work on
unpatched Pg10, probably using a separate bgworker and normal libpq
connections to the upstream since we don't have hooks to extend the
walsender/walreceiver.

It can work now that the catalog_xmin hot_standby_feedback patches are in,
but it'd require some low-level slot state setting that I know Andres is
not a fan of. So I expect to carry on relying on an out-of-tree failover
slots patch for Pg 10.

> So, you could create a slot on a standby with an "uplink this" flag of
> some kind, and it would then try to keep it up to date using the
> method described above. It's not quite clear to me how to handle the
> case where the corresponding slot doesn't exist on the master, or
> initially does but then it's later dropped, or it initially doesn't
> but it's later created.
>
> Thoughts?

Right. So the standby must be running and in active communication. It needs
some way to know the master has confirmed slot creation and it can rely on
the slot's resources really being reserved by the master. That turns out to
be quite hard, per the decoding on standby patches. There needs to be some
way to tell the master a standby has gone away forever and to drop its
dependent slots, so you're not stuck wondering "is slot xxyz from standby
abc that we lost in that crash?". Standbys need to cope with having created
a slot, only to find out there's a name collision with master.

For all those reasons, I just extended hot_standby_feedback to report
catalog_xmin separately to upstreams instead, so the existing physical slot
serves all these needs. And it's part of the picture, but there's no way to
get slot position change info from the master back down onto the replicas
so the replicas can advance any of their own slots and, via feedback, free
up master resources. That's where the bgworker hack to query
pg_replication_slots comes in. Seems complex, full of restrictions, and
fragile to me compared to just expecting the master to do it.

The only objection I personally understood and accepted re failover slots
was that it'd be impossible to create a failover slot on a standby and have
that standby "sub-tree" support failover to leaf nodes. Which is true, but
instead we have noting and no viable looking roadmap toward anything users
can benefit from. So I don't think that's the worst restriction in the
world.

I do not understand why logical replication slots are exempt from our usual
policy that anything that works on the master should be expected to work on
failover to a standby. Is there anything persistent across crash for which
that's not the case, except grandfathered-in hash indexes? We're hardly
going to say "hey, it's ok to forget about prepared xacts when you fail
over to a standby" yet this problem with failover and slots in logical
decoding and replication is the same sort of showstopper issue for users
who use the functionality.

In the medium term I've given up making progress with getting something
simple and usable into user hands on this. A tweaked version of failover
slots is being carried as an out-of-tree on-disk-format-compatible patch
instead, and it's meeting customer needs very well. I've done my dash here
and moved on to other things where I can make more progress.

I'd like to continue working on logical decoding on standby support for
pg11 too, but even if we can get that in place it'll only work for
reachable, online standbys. Every application that uses logical decoding
will have to maintain a directory of standbys (which it has no way to ask
the master for) and advance their slots via extra walsender connections.
They'll do a bunch of unnecessary work decoding WAL they don't need to just
to throw the data away. It won't help for PITR and snapshot use cases at
all. So for now I'm not able to allocate much priority to that.

I'd love to get failover slots in, I still think it's the simplest and best
way to do what users need. It doesn't stop us progressing with decoding on
standby or paint us into any corners.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-08-08 08:51:02 Re: Partition-wise join for join between (declaratively) partitioned tables
Previous Message Amit Kapila 2017-08-08 07:50:23 Re: why not parallel seq scan for slow functions