Re: Timeline following for logical slots

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Timeline following for logical slots
Date: 2016-04-01 05:16:18
Message-ID: CAMsr+YGOVpFdh1Xr4j8pAUncLbHR1bHkDX4Q+NnODj-h4hh8kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 31 March 2016 at 16:09, Andres Freund <andres(at)anarazel(dot)de> wrote:

> > This gives us an option for failover of logical replication in 9.6, even
> if
> > it's a bit cumbersome and complex for the client, in case failover slots
> > don't make the cut. And, of course, it's a pre-req for failover slots,
> > which I'll rebase on top of it shortly.
>
> FWIW, I think it's dangerous to use this that way. If people manipulate
> slots that way we'll have hellishly to debug issues.

I think it's pretty unsafe from SQL, to be sure.

Unless failover slots get in to 9.6 we'll need to do exactly that from
internal C stuff in pglogical to support following physical failover,
though. I'm not thrilled by that, especially since we have no way (in 9.6)
to insert generic WAL records in xlog to allow the slot updates to
accurately pace replay. I don't mean logical wal messages here, they
wouldn't help, it'd actually be pluggable generic WAL that we'd need to
sync slots fully consistently in the absence of failover slots.

However: It's safe for the slot state on the replica to be somewhat behind
the local replay from the master, so the slot state on the replica is older
than what it would've been at an equivalent time on the master... so long
as it's not so far behind that the replica has replayed vacuum activity
that removes rows still required by the slot's declared catalog_xmin. Even
then it should actually be OK in practice because the failing-over client
will always have replayed past that point on the master (otherwise
catalog_xmin couldn't have advanced on the master), so it'll always ask to
start replay at a point at or after the lsn where the catalog_xmin was
advanced to its most recent value on the old master before failover. It's
only safe for walsender based decoding though, since there's no way to
specify the startpoint for sql-level decoding.

My only real worry there is whether invalidations are a concern - if
internal replay starts at the restart_lsn and replays over part of history
where the catalog entries have been vacuumed before it starts collecting
rows for return to the decoding client at the requested decoding
startpoint, can that cause problems with invalidation processing? I haven't
got my head around what, exactly, logical decoding is doing with its
invalidations stuff yet. I didn't find any problems in testing, but wasn't
sure how to create the conditions under which failures might occur.

It's also OK for the slot state to be *ahead* of the replica's replay
position, i.e. from the future. restart_lsn and catalog_xmin will be higher
than they were on the master at the same point in time, but that doesn't
matter at all, since the client will always be requesting to start from
that point or later, having replayed up to there on the old master before
failover.

It's even possible for the restart_lsn and catalog_xmin to be in the
replica's future, i.e. the lsn > the current replay lsn and the
catalog_xmin greater than the next xid. In that case we know the logical
client replayed state from the old master that hasn't been applied to the
replica by the time it's promoted. If this state of affairs exists when we
promote a replica to a master we should really kill the slot, since the
client has obviously replayed from the old master past the point of
divergence with the promoted replica and there's a potentially an
unresolvable timeline fork. I'd like to add this if I can manage it,
probably by WARNing a complaint then dropping the slot.

Needless to say I'd really rather just have failover slots, where we know
the slot will be consistent with the DB state and updated in sync with it.
This is a fallback plan and I don't like it too much.

> The test code needs
> a big disclaimer to never ever be used in production, and we should
> "disclaim any warranty" if somebody does that. To the point of not
> fixing issues around it in back branches.
>

I agree. The README has just that, and there are comments above the
individual functions like:

* Note that this is test harness code. You shouldn't expose slot internals
* to SQL like this for any real world usage. See the README.

> Andres, I tried to address your comments as best I could. The main one
> that
> > I think stayed open was about the loop that finds the last timeline on a
> > segment. If you think that's better done by directly scanning the List*
> of
> > timeline history entries I'm happy to prep a follow-up.
>
> Have to look again.
>
> + * We start reading xlog from the restart lsn, even though in
> + * CreateDecodingContext we set the snapshot builder up using the
> + * slot's confirmed_flush. This means we might read xlog we don't
> + * actually decode rows from, but the snapshot builder might need
> it
> + * to get to a consistent point. The point we start returning data
> to
> + * *users* at is the confirmed_flush lsn set up in the decoding
> + * context.
> + */
> still seems pretty misleading - and pretty much unrelated to the
> callsite.
>

I think it's pretty confusing that the function uses's the slot's
restart_lsn and never refers to confirmed_flush which is where it actually
starts decoding from as far as the user's concerned. It took me a while to
figure out what was going on there.

I think the comment would be less necessary, and could be moved up to the
CreateDecodingContext call, if we passed the slot's confirmed_flush
explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr
and having the CreateDecodingContext call infer the start point. That way
what's going on would be a lot clearer when reading the code.

You're thinking from the perspective of someone who knows this code
intimately. It's obvious that restart_lsn controls where the xlogreader
starts processing and feeding into the snapshot builder to you. Similarly,
it'll be obvious that the decoding context defaults to the confirmed_flush
point which is where we start returning decoded rows to users. This really
will NOT be obvious to most readers though, and I think we could use making
the logical decoding code easier to understand as it gets more and more
use. I'm trying to help with that, and while I suffer from not knowing it
as well as you, that also means I can still see some of the things that
might be confusing to newer readers. Then get them wrong when explaining
them ;)

Speaking of which, did you see the proposed README I sent for
src/backend/replication/logical ?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-04-01 05:38:52 Re: [PATCH v10] GSSAPI encryption support
Previous Message Michael Paquier 2016-04-01 05:04:59 Re: OOM in libpq and infinite loop with getCopyStart()