Re: logical decoding and replication of sequences, take 2

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: logical decoding and replication of sequences, take 2
Date: 2022-11-17 17:07:16
Message-ID: 20221117170716.5da7set7fy3t6mfx@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:
> On 11/17/22 03:43, Andres Freund wrote:
> > On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
> >> Well, yeah - we can either try to perform the stuff independently of the
> >> transactions that triggered it, or we can try making it part of some of
> >> the transactions. Each of those options has problems, though :-(
> >>
> >> The first version of the patch tried the first approach, i.e. decode the
> >> increments and apply that independently. But:
> >>
> >> (a) What would you do with increments of sequences created/reset in a
> >> transaction? Can't apply those outside the transaction, because it
> >> might be rolled back (and that state is not visible on primary).
> >
> > I think a reasonable approach could be to actually perform different WAL
> > logging for that case. It'll require a bit of machinery, but could actually
> > result in *less* WAL logging overall, because we don't need to emit a WAL
> > record for each SEQ_LOG_VALS sequence values.
> >
>
> Could you elaborate? Hard to comment without knowing more ...
>
> My point was that stuff like this (creating a new sequence or at least a
> new relfilenode) means we can't apply that independently of the
> transaction (unlike regular increments). I'm not sure how a change to
> WAL logging would make that go away.

Different WAL logging would make it easy to handle that on the logical
decoding level. We don't need to emit WAL records each time a
created-in-this-toplevel-xact sequences gets incremented as they're not
persisting anyway if the surrounding xact aborts. We already need to remember
the filenode so it can be dropped at the end of the transaction, so we could
emit a single record for each sequence at that point.

> >> (b) What about increments created before we have a proper snapshot?
> >> There may be transactions dependent on the increment. This is what
> >> ultimately led to revert of the patch.
> >
> > I don't understand this - why would we ever need to process those increments
> > from before we have a snapshot? Wouldn't they, by definition, be before the
> > slot was active?
> >
> > To me this is the rough equivalent of logical decoding not giving the initial
> > state of all tables. You need some process outside of logical decoding to get
> > that (obviously we have some support for that via the exported data snapshot
> > during slot creation).
> >
>
> Which is what already happens during tablesync, no? We more or less copy
> sequences as if they were tables.

I think you might have to copy sequences after tables, but I'm not sure. But
otherwise, yea.

> > I assume that part of the initial sync would have to be a new sequence
> > synchronization step that reads all the sequence states on the publisher and
> > ensures that the subscriber sequences are at the same point. There's a bit of
> > trickiness there, but it seems entirely doable. The logical replication replay
> > support for sequences will have to be a bit careful about not decreasing the
> > subscriber's sequence values - the standby initially will be ahead of the
> > increments we'll see in the WAL. But that seems inevitable given the
> > non-transactional nature of sequences.
> >
>
> See fetch_sequence_data / copy_sequence in the patch. The bit about
> ensuring the sequence does not go away (say, using page LSN and/or LSN
> of the increment) is not there, however isn't that pretty much what I
> proposed doing for "reconciling" the sequence state logged at COMMIT?

Well, I think the approach of logging all sequence increments at commit is the
wrong idea...

Creating a new relfilenode whenever a sequence is incremented seems like a
complete no-go to me. That increases sequence overhead by several orders of
magnitude and will lead to *awful* catalog bloat on the subscriber.

> >
> >> This version of the patch tries to do the opposite thing - make sure
> >> that the state after each commit matches what the transaction might have
> >> seen (for sequences it accessed). It's imperfect, because it might log a
> >> state generated "after" the sequence got accessed - it focuses on the
> >> guarantee not to generate duplicate values.
> >
> > That approach seems quite wrong to me.
> >
>
> Why? Because it might log a state for sequence as of COMMIT, when the
> transaction accessed the sequence much earlier?

Mainly because sequences aren't transactional and trying to make them will
require awful contortions.

While there are cases where we don't flush the WAL / wait for syncrep for
sequences, we do replicate their state correctly on physical replication. If
an LSN has been acknowledged as having been replicated, we won't just loose a
prior sequence increment after promotion, even if the transaction didn't [yet]
commit.

It's completely valid for an application to call nextval() in one transaction,
potentially even abort it, and then only use that sequence value in another
transaction.

> > I did some skimming of the referenced thread about the reversal of the last
> > approach, but I couldn't really understand what the fundamental issues were
> > with the reverted implementation - it's a very long thread and references
> > other threads.
> >
>
> Yes, it's long/complex, but I intentionally linked to a specific message
> which describes the issue ...
>
> It's entirely possible there is a simple fix for the issue, and I just
> got confused / unable to see the solution. The whole issue was due to
> having a mix of transactional and non-transactional cases, similarly to
> logical messages - and logicalmsg_decode() has the same issue, so maybe
> let's talk about that for a moment.
>
> See [3] and imagine you're dealing with a transactional message, but
> you're still building a consistent snapshot. So the first branch applies:
>
> if (transactional &&
> !SnapBuildProcessChange(builder, xid, buf->origptr))
> return;
>
> but because we don't have a snapshot, SnapBuildProcessChange does this:
>
> if (builder->state < SNAPBUILD_FULL_SNAPSHOT)
> return false;

In this case we'd just return without further work in logicalmsg_decode(). The
problematic case presumably is is when we have a full snapshot but aren't yet
consistent, but xid is >= next_phase_at. Then SnapBuildProcessChange() returns
true. And we reach:

> which however means logicalmsg_decode() does
>
> snapshot = SnapBuildGetOrBuildSnapshot(builder);
>
> which crashes, because it hits this assert:
>
> Assert(builder->state == SNAPBUILD_CONSISTENT);

I think the problem here is just that we shouldn't even try to get a snapshot
in the transactional case - note that it's not even used in
ReorderBufferQueueMessage() for transactional message. The transactional case
needs to behave like a "normal" change - we might never decode the message if
the transaction ends up committing before we've reached a consistent point.

> The sequence decoding did almost the same thing, with the same issue.
> Maybe the correct thing to do is to just ignore the change in this case?

No, I don't think that'd be correct, the message | sequence needs to be queued
for the transaction. If the transaction ends up committing after we've reached
consistency, we'll get the correct snapshot from the base snapshot set in
SnapBuildProcessChange().

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-11-17 17:18:00 Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
Previous Message Simon Riggs 2022-11-17 17:04:39 Re: SUBTRANS: Minimizing calls to SubTransSetParent()