Re: WIP: Failover Slots

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Failover Slots
Date: 2016-04-06 11:49:08
Message-ID: CAMsr+YHYV78q_8gDKOgTNZQD9Lrfwa=5E0kOfFbrjjTDHhX+4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6 April 2016 at 17:43, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 25 January 2016 at 14:25, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>
>
>> I'd like to get failover slots in place for 9.6 since the're fairly
>> self-contained and meet an immediate need: allowing replication using slots
>> (physical or logical) to follow a failover event.
>>
>
> I'm a bit confused about this now.
>
> We seem to have timeline following, yet no failover slot. How do we now
> follow a failover event?
>

>

> There are many and varied users of logical decoding now and a fix is
> critically important for 9.6.
>

I agree with you, but I haven't been able to convince enough people of that.

> Do all decoding plugins need to write their own support code??
>

We'll be able to write a bgworker based extension that handles it by
running in the standby. So no, I don't think so.

> Please explain how we cope without this, so if a problem remains we can
> fix by the freeze.
>

The TL;DR: Create a slot on the master to hold catalog_xmin where the
replica needs it. Advance it using client or bgworker on replica based on
the catalog_xmin of the oldest slot on the replica. Copy slot state from
the master using an extension that keeps the slots on the replica
reasonably up to date.

All of this is ugly workaround for not having true slot failover support.
I'm not going to pretend it's nice, or anything that should go anywhere
near core. Petr outlined the approach we want to take for core in 9.7 on
the logical timeline following thread.

Details:

Logical decoding on a slot can follow timeline switches now - or rather,
the xlogreader knows how to follow timeline switches, and the read page
callback used by logical decoding uses that functionality now.

This doesn't help by its self because slots aren't synced to replicas so
they're lost on failover promotion.

Nor can a client just create a backup slot for its self on the replica to
be ready for failover:

- it has no way to create a new slot at a consistent point on the replica
since logical decoding isn't supported on replicas yet;
- it can't advance a logical slot on the replica once created since
decoding isn't permitted on a replica, so it can't just decode from the
replica in lockstep with the master;
- it has no way to stop the master from removing catalog tuples still
needed by the slot's catalog_xmin since catalog_xmin isn't propagated from
standby to master.

So we have to help the client out. To do so, we have a
function/worker/whatever on the replica that grabs the slot state from the
master and copies it to the replica, and we have to hold the master's
catalog_xmin down to the catalog_xmin required by the slots on the replica.

Holding the catalog_xmin down is the easier bit. We create a dummy logical
slot on the master, maintained by a function/bgworker/whatever on the
replica. It gets advanced so that its restart_lsn and catalog_xmin are
those of the oldest slot on the replica. We can do that by requesting
replay on it up to the confirmed_lsn of the lowest confirmed_lsn on the
replica. Ugly, but workable. Or we can abuse the infrastructure more deeply
by simply setting the catalog_xmin and restart_lsn on the slot directly,
but I'd rather not.

Just copying slot state is pretty simple too, as at the C level you can
create a physical or logical slot with whatever state you want.

However, that lets you copy/create any number of bogus ones, many of which
will appear to work fine but will be subtly broken. Since the replica is an
identical copy of the master we know that a slot state that was valid on
the master at a given xlog insert lsn is also valid on the replica at the
same replay lsn, but we've got no reliable way to ensure that when the
master updates a slot at LSN A/B the replica also updates the slot at
replay of LSN A/B. That's what failover slots did. Without that we need to
use some external channel - but there's no way to capture knowledge of "at
exactly LSN A/B, master saved a new copy of slot X" since we can't hook
ReplicationSlotSave(). At least we *can* now inject slot state updates as
generic WAL messages though, so we can ensure they happen at exactly the
desired point in replay.

As Andres explained on the timeline following thread it's not safe for the
slot on the replica to be behind the state the slot on the master was at
the same LSN. At least unless we can protect catalog_xmin via some other
mechanism so we can make sure no catalogs still needed by the slots on the
replica are vacuumed away. It's vital that the catalog_xmin of any slots on
the replica be >= the catalog_xmin the master had for the lowest
catalog_xmin of any of its slots at the same LSN.

So what I figure we'll do is poll slot shmem on the master. When we notice
that a slot has changed we'll dump it into xlog via the generic xlog
mechanism to be applied on the replica, much like failover slots. The slot
update might arrive a bit late on the replica, but that's OK because we're
holding catalog_xmin pinned on the master using the dummy slot.

I don't like it, but I don't have anything better for 9.6.

I'd really like to be able to build a more solid proof of concept that
tests this with a lagging replica, but -ENOTIME before FF.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-04-06 11:59:23 Re: Support for N synchronous standby servers - take 2
Previous Message Robert Haas 2016-04-06 11:24:11 Re: pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server