Re: Minimal logical decoding on standbys

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minimal logical decoding on standbys
Date: 2018-12-14 00:55:21
Message-ID: 20181214005521.jaty2d24lz4nroil@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2018-12-13 19:32:19 -0500, Robert Haas wrote:
> On Wed, Dec 12, 2018 at 3:41 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I don't like the approach of managing the catalog horizon via those
> > periodically logged catalog xmin announcements. I think we instead
> > should build ontop of the records we already have and use to compute
> > snapshot conflicts. As of HEAD we don't know whether such tables are
> > catalog tables, but that's just a bool that we need to include in the
> > records, a basically immeasurable overhead given the size of those
> > records.
>
> To me, this paragraph appears to say that you don't like Craig's
> approach without quite explaining why you don't like it. Could you be
> a bit more explicit about that?

I think the conflict system introduced in Craig's patch is quite
complicated, relies on logging new wal records on a regular basis, adds
needs to be more conservative about the xmin horizon, which is obviously
not great for performance.

If you look at Craig's patch, it currently relies on blocking out
concurrent checkpoints:
/*
* We must prevent a concurrent checkpoint, otherwise the catalog xmin
* advance xlog record with the new value might be written before the
* checkpoint but the checkpoint may still see the old
* oldestCatalogXmin value.
*/
if (!LWLockConditionalAcquire(CheckpointLock, LW_SHARED))
/* Couldn't get checkpointer lock; will retry later */
return;
which on its own seems unacceptable, given that CheckpointLock can be
held by checkpointer for a very long time. While that's ongoing the
catalog xmin horizon doesn't advance.

Looking at the code it seems hard, to me, to make that approach work
nicely. But I might just be tired.

> > I'm wondering if it's time to move the latestRemovedXid computation for
> > this type of record to the primary - it's likely to be cheaper there and
> > avoids this kind of complication. Secondarily, it'd have the advantage
> > of making pluggable storage integration easier - there we have the
> > problem that we don't know which type of relation we're dealing with
> > during recovery, so such lookups make pluggability harder (zheap just
> > adds extra flags to signal that, but that's not extensible).
>
> That doesn't look trivial. It seems like _bt_delitems_delete() would
> need to get an array of XIDs, but that gets called from
> _bt_vacuum_one_page(), which doesn't have that information available.
> It doesn't look like there is a particularly cheap way of getting it,
> either. What do you have in mind?

I've a prototype attached, but let's discuss the details in a separate
thread. This also needs to be changed for pluggable storage, as we don't
know about table access methods in the startup process, so we can't call
can't determine which AM the heap is from during
btree_xlog_delete_get_latestRemovedXid() (and sibling routines).

Writing that message right now.

> > - enforce hot-standby to be enabled on the standby when logical slots
> > are created, and at startup if a logical slot exists
>
> Why do we need this?

Currently the conflict routines are only called when hot standby is
on. There's also no way to use logical decoding (including just advancing the
slot), without hot-standby being enabled, so I think that'd be a pretty
harmless restriction.

> > - Have a nicer conflict handling than what I implemented here. Craig's
> > approach deleted the slots, but I'm not sure I like that. Blocking
> > seems more appropriately here, after all it's likely that the
> > replication topology would be broken afterwards.
>
> I guess the viable options are approximately --

> (1) drop the slot

Doable.

> (2) advance the slot

That's not realistically possible, I think. We'd need to be able to use
most of the logical decoding infrastructure in that context, and we
don't have that available. It's also possible to deadlock, where
advancing the slot's xmin horizon would need further WAL, but WAL replay
is blocked on advancing the slot.

> (3) mark the slot as "failed" but leave it in existence as a tombstone

We currently don't have that, but it'd be doable, I think.

> (4) wait until something changes.
> (4) seems pretty unfortunate unless there's some other system for
> having the slot advance automatically. Seems like a way for
> replication to hang indefinitely without anybody understanding why
> it's happened (or, maybe, noticing).

On the other hand, it would often allow whatever user of the slot to
continue using it, till the conflict is "resolved". To me it seems about
as easy to debug physical replication being blocked, as somehow the slot
being magically deleted or marked as invalid.

Thanks for looking,

Andres Freund

Attachment Content-Type Size
index-page-vacuum-xid-horizon-primary.diff text/x-diff 21.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-12-14 01:21:12 Re: automatically assigning catalog toast oids
Previous Message Robert Haas 2018-12-14 00:32:19 Re: Minimal logical decoding on standbys