Re: Minimal logical decoding on standbys

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, fabriziomello(at)gmail(dot)com, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rahila Syed <rahila(dot)syed(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minimal logical decoding on standbys
Date: 2021-10-28 21:07:55
Message-ID: 20211028210755.afmwcvpo6ajwdx6n@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:
> On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot(at)amazon(dot)com> wrote:
> > So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
> > RelationIsAccessibleInLogicalDecoding()?
>
> I think 0001 is utterly unacceptable. We cannot add calls to
> table_open() in low-level functions like this. Suppose for example
> that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
> would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
> a buffer lock on the page. The idea that it's safe to call
> table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache entry is
for a catalog from the table to the index. Then we'd not need to change the
btree code to pass the table down.

> That could do arbitrary amounts of work taking any number of other
> buffer locks, which could easily deadlock (and the deadlock detector
> wouldn't help, since these are lwlocks). Even if that were no issue,
> we really, really do not want to write code that could result in large
> numbers of additional calls to table_open() -- and _bt_getbuf() is
> certainly a frequently-used function.

The BTPageIsRecyclable() path hopefully less so. Not that that makes it OK.

> I think that, in order to have
> any chance of being acceptable, this would need to be restructured so
> that it pulls data from an existing relcache entry that is known to be
> valid, without attempting to create a new one. That is,
> get_rel_logical_decoding() would need to take a Relation argument, not
> an OID.

Hm? Once we have a relation we don't really need the helper function anymore.

> I also think it's super-weird that the value being logged is computed
> using RelationIsAccessibleInLogicalDecoding(). That means that if
> wal_level < logical, we'll set onCatalogTable = false in the xlog
> record, regardless of whether that's true or not. Now I suppose it
> won't matter, because presumably this field is only going to be
> consulted for whatever purpose when logical replication is active, but
> I object on principle to the idea of a field whose name suggests that
> it means one thing and whose value is inconsistent with that
> interpretation.

Hm. Not sure what a good solution for this is. I don't think we should make
the field independent of wal_level - it doesn't really mean anything with a
lower wal_level. And it increases the illusion that the table is guaranteed to
be a system table or something a bit. Perhaps the field name should hint at
this being logically decoding related?

> I also notice that 0003 deletes a comment that says "We need to force
> hot_standby_feedback to be enabled at all times so the primary cannot
> remove rows we need," but also that this is the only mention of
> hot_standby_feedback in the entire patch set. If the existing comment
> that we need to do something about that is incorrect, we should update
> it independently of this patch set to be correct. But if the existing
> comment is correct then there ought to be something in the patch that
> deals with it.

The patch deals with this - we'll detect the removal of row versions that
aren't needed anymore and stop decoding. Of course you'll most of the time
want to use hs_feedback, but sometimes it'll also just be a companion slot on
the primary or such (think slots for failover or such).

> Another part of that same deleted comment says "We need to be able to
> correctly and quickly identify the timeline LSN belongs to," but I
> don't see what the patch does about that, either. I'm actually not
> sure exactly what that's talking about

Hm - could you expand on what you're unclear about re LSN->timeline? It's just
that we need to read a WAL page for a certain LSN, and for that we need the
timeline?

> , but today for unrelated
> reasons I happened to be looking at logical_read_xlog_page(), which is
> actually what caused me to look at this thread. In that function we
> have, as the first two lines of executable code:
>
> XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
> sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
>
> The second line of code depends on the value of ThisTimeLineID. The
> first line of code does too, because XLogReadDetermineTimeline() uses
> that variable internally. If logical decoding is only allowed on a
> primary, then there can't really be an issue here, because we will
> have checked RecoveryInProgress() in
> CheckLogicalDecodingRequirements() and ThisTimeLineID will have its
> final value. But on a standby, I'm not sure that ThisTimeLineID even
> has to be initialized here, and I really can't see any reason at all
> why the value it contains is necessarily still current.

I think the code tries to deal with this via XLogReadDetermineTimeline(),
which limits up to where WAL is valid on the current timeline, based on the
timeline history file. But as you say, it does rely on ThisTimeLineID for
that, and it's not obvious why it's likely current, let alone guaranteed to be
current.

> This function's sister, read_local_xlog_page(), contains a bunch of logic
> that tries to make sure that we're always reading every record from the
> right timeline, but there's nothing similar here. I think that would likely
> have to be fixed in order for decoding to work on standbys, but maybe I'm
> missing something.

I think that part actually works, afaict they both rely on the same
XLogReadDetermineTimeline() for that job afaict. What might be missing is
logic to update the target timeline.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2021-10-28 21:19:10 Re: dummy relation in partitionwise join
Previous Message Robert Haas 2021-10-28 21:05:50 Re: Correct error message for end-of-recovery record TLI