Re: logical decoding and replication of sequences, take 2

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: logical decoding and replication of sequences, take 2
Date: 2022-11-17 16:04:52
Message-ID: CA+TgmoZp=-J8S2szkjD-eCLX23SjnUPnk-YKJJZZTsJ7k1bUpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 16, 2022 at 8:41 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> There's a couple of caveats, though:
>
> 1) Maybe we should focus more on "actually observed" state instead of
> "observable". Who cares if the sequence moved forward in a transaction
> that was ultimately rolled back? No committed transaction should have
> observer those values - in a way, the last "valid" state of the sequence
> is the last value generated in a transaction that ultimately committed.

When I say "observable" I mean from a separate transaction, not one
that is making changes to things.

I said "observable" rather than "actually observed" because we neither
know nor care whether someone actually ran a SELECT statement at any
given moment in time, just what they would have seen if they did.

> 2) I think what matters more is that we never generate duplicate value.
> That is, if you generate a value from a sequence, commit a transaction
> and replicate it, then the logical standby should not generate the same
> value from the sequence. This guarantee seems necessary for "failover"
> to logical standby.

I think that matters, but I don't think it's sufficient. We need to
preserve the order in which things appear to happen, and which changes
are and are not atomic, not just the final result.

> Well, yeah - we can either try to perform the stuff independently of the
> transactions that triggered it, or we can try making it part of some of
> the transactions. Each of those options has problems, though :-(
>
> The first version of the patch tried the first approach, i.e. decode the
> increments and apply that independently. But:
>
> (a) What would you do with increments of sequences created/reset in a
> transaction? Can't apply those outside the transaction, because it
> might be rolled back (and that state is not visible on primary).

If the state isn't going to be visible until the transaction commits,
it has to be replicated as part of the transaction. If I create a
sequence and then nextval it a bunch of times, I can't replicate that
by first creating the sequence, and then later, as a separate
operation, replicating the nextvals. If I do that, then there's an
intermediate state visible on the replica that was never visible on
the origin server. That's broken.

> (b) What about increments created before we have a proper snapshot?
> There may be transactions dependent on the increment. This is what
> ultimately led to revert of the patch.

Whatever problem exists here is with the implementation, not the
concept. If you copy the initial state as it exists at some moment in
time to a replica, and then replicate all the changes that happen
afterward to that replica without messing up the order, the replica
WILL be in sync with the origin server. The things that happen before
you copy the initial state do not and cannot matter.

But what you're describing sounds like the changes aren't really
replicated in visibility order, and then it is easy to see how a
problem like this can happen. Because now, an operation that actually
became visible just before or just after the initial copy was taken
might be thought to belong on the other side of that boundary, and
then everything will break. And it sounds like that is what you are
describing.

> This version of the patch tries to do the opposite thing - make sure
> that the state after each commit matches what the transaction might have
> seen (for sequences it accessed). It's imperfect, because it might log a
> state generated "after" the sequence got accessed - it focuses on the
> guarantee not to generate duplicate values.

Like Andres, I just can't imagine this being correct. It feels like
it's trying to paper over the failure to do the replication properly
during the transaction by overwriting state at the end.

> Yes, this would mean we accept we may end up with something like this:
>
> 1: T1 logs sequence state S1
> 2: someone increments sequence
> 3: T2 logs sequence stats S2
> 4: T2 commits
> 5: T1 commits
>
> which "inverts" the apply order of S1 vs. S2, because we first apply S2
> and then the "old" S1. But as long as we're smart enough to "discard"
> applying S1, I think that's acceptable - because it guarantees we'll not
> generate duplicate values (with values in the committed transaction).
>
> I'd also argue it does not actually generate invalid state, because once
> we commit either transaction, S2 is what's visible.

I agree that it's OK if the sequence increment gets merged into the
commit that immediately follows. However, I disagree with the idea of
discarding the second update on the grounds that it would make the
sequence go backward and we know that can't be right. That algorithm
works in the really specific case where the only operations are
increments. As soon as anyone does anything else to the sequence, such
an algorithm can no longer work. Nor can it work for objects that are
not sequences. The alternative strategy of replicating each change
exactly once and in the correct order works for all current and future
object types in all cases.

> > Your alternative proposal says "The other option might be to make
> > these messages non-transactional, in which case we'd separate the
> > ordering from COMMIT ordering, evading the reordering problem." But I
> > don't think that avoids the reordering problem at all.
>
> I don't understand why. Why would it not address the reordering issue?
>
> > Nor do I think it's correct.
>
> Nor do I understand this. I mean, isn't it essentially the option you
> mentioned earlier - treating the non-transactional actions as
> independent transactions? Yes, we'd be batching them so that we'd not
> see "intermediate" states, but those are not observed by abyone.

I don't think that batching them is a bad idea, in fact I think it's
necessary. But those batches still have to be applied at the right
time relative to the sequence of commits.

> I'm confused. Isn't that pretty much exactly what I'm proposing? Imagine
> you have something like this:
>
> 1: T1 does something and also increments a sequence
> 2: T1 logs state of the sequence (right before commit)
> 3: T1 writes COMMIT
>
> Now when we decode/apply this, we end up doing this:
>
> 1: decode all T1 changes, stash them
> 2: decode the sequence state and apply it separately
> 3: decode COMMIT, apply all T1 changes
>
> There might be other transactions interleaving with this, but I think
> it'd behave correctly. What example would not work?

What if one of the other transactions renames the sequence, or changes
the current value, or does basically anything to it other than
nextval?

> The problem lies in how we log sequences. If we wrote each individual
> increment to WAL, it might work the way you propose (except for cases
> with sequences created in a transaction, etc.). But that's not what we
> do - we log sequence increments in batches of 32 values, and then only
> modify the sequence relfilenode.
>
> This works for physical replication, because the WAL describes the
> "next" state of the sequence (so if you do "SELECT * FROM sequence"
> you'll not see the same state, and the sequence value may "jump ahead"
> after a failover).
>
> But for logical replication this does not work, because the transaction
> might depend on a state created (WAL-logged) by some other transaction.
> And perhaps that transaction actually happened *before* we even built
> the first snapshot for decoding :-/

I agree that there's a problem here but I don't think that it's a huge
problem. I think that it's not QUITE right to think about what state
is visible on the primary. It's better to think about what state would
be visible on the primary if it crashed and restarted after writing
any given amount of WAL, or what would be visible on a physical
standby after replaying any given amount of WAL. If logical
replication mimics that, I think it's as correct as it needs to be. If
not, those other systems are broken, too.

So I think what should happen is that when we write a WAL record
saying that the sequence has been incremented by 32, that should be
logically replicated after all commits whose commit record precedes
that WAL record and before commits whose commit record follows that
WAL record. It is OK to merge the replication of that record into one
of either the immediately preceding or the immediately following
commit, but you can't do it as part of any other commit because then
you're changing the order of operations.

For instance, consider:

T1: BEGIN; INSERT; COMMIT;
T2: BEGIN; nextval('a_seq') causing a logged advancement to the sequence;
T3: BEGIN; nextval('b_seq') causing a logged advancement to the sequence;
T4: BEGIN; INSERT; COMMIT;
T2: COMMIT;
T3: COMMIT;

The sequence increments can be replicated as part of T1 or part of T4
or in between applying T1 and T4. They cannot be applied as part of T2
or T3. Otherwise, suppose T4 read the current value of one of those
sequences and included that value in the inserted row, and the target
table happened to be the sequence_value_at_end_of_period table. Then
imagine that after receiving the data for T4 and replicating it, the
primary server is hit by a meteor and the replica is promoted. Well,
it's now possible for some new transaction to get a value from that
sequence than what has already been written to the
sequence_value_at_end_of_period table, which will presumably break the
application.

> > In particular, I think it's likely that the "non-transactional
> > messages" that you mention earlier don't get applied at the point in
> > the commit sequence where they were found in the WAL. Not sure why
> > exactly, but perhaps the point at which we're reading WAL runs ahead
> > of the decoding per se, or something like that, and thus those
> > non-transactional messages arrive too early relative to the commit
> > ordering. Possibly that could be changed, and they could be buffered
>
> I'm not sure which case of "non-transactional messages" this refers to,
> so I can't quite respond to these comments. Perhaps you mean the
> problems that killed the previous patch [1]?

In http://postgr.es/m/8bf1c518-b886-fe1b-5c42-09f9c663146d@enterprisedb.com
you said "The other option might be to make these messages
non-transactional". I was referring to that.

> > the transaction whose commit record occurs most nearly prior to, or
> > most nearly after, the WAL record for the operation itself. Or else,
> > we could create "virtual" transactions for such operations and make
> > sure those get replayed at the right point in the commit sequence. Or
> > else, I don't know, maybe something else. But I think the overall
> > picture is that we need to approach the problem by replicating changes
> > in WAL order, as a physical standby would do. Saying that a change is
> > "nontransactional" doesn't mean that it's exempt from ordering
> > requirements; rather, it means that that change has its own place in
> > that ordering, distinct from the transaction in which it occurred.
>
> But doesn't the approach with WAL-logging sequence state before COMMIT,
> and then applying it independently in WAL-order, do pretty much this?

I'm sort of repeating myself here, but: only if the only operations
that ever get performed on sequences are increments. Which is just not
true.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-11-17 16:24:29 Re: CREATE UNLOGGED TABLE seq faults when debug_discard_caches=1
Previous Message Himanshu Upadhyaya 2022-11-17 16:03:17 Re: HOT chain validation in verify_heapam()