Re: logical decoding and replication of sequences, take 2

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: logical decoding and replication of sequences, take 2
Date: 2022-11-16 21:05:04
Message-ID: CA+TgmoaYG7672OgdwpGm5cOwy8_ftbs=3u-YMvR9fiJwQUzgrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> The other option might be to make these messages non-transactional, in
> which case we'd separate the ordering from COMMIT ordering, evading the
> reordering problem.
>
> That'd mean we'd ignore rollbacks (which seems fine), we could probably
> optimize this by checking if the state actually changed, etc. But we'd
> also need to deal with transactions created in the (still uncommitted)
> transaction. But I'm also worried it might lead to the same issue with
> non-transactional behaviors that forced revert in v15.

I think it might be a good idea to step back slightly from
implementation details and try to agree on a theoretical model of
what's happening here. Let's start by banishing the words
transactional and non-transactional from the conversation and talk
about what logical replication is trying to do.

We can imagine that the replicated objects on the primary pass through
a series of states S1, S2, ..., Sn, where n keeps going up as new
state changes occur. The state, for our purposes here, is the contents
of the database as they could be observed by a user running SELECT
queries at some moment in time chosen by the user. For instance, if
the initial state of the database is S1, and then the user executes
BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the
state that differs from S1 in that both of those rows are now part of
the database contents. There is no state where one of those rows is
visible and the other is not. That was never observable by the user,
except from within the transaction as it was executing, which we can
and should discount. I believe that the goal of logical replication is
to bring about a state of affairs where the set of states observable
on the standby is a subset of the states observable on the primary.
That is, if the primary goes from S1 to S2 to S3, the standby can do
the same thing, or it can go straight from S1 to S3 without ever
making it possible for the user to observe S2. Either is correct
behavior. But the standby cannot invent any new states that didn't
occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to
S3, or something like that. It can only consolidate changes that
occurred separately on the primary, never split them up. Neither can
it reorder them.

Now, if you accept this as a reasonable definition of correctness,
then the next question is what consequences it has for transactional
and non-transactional behavior. If all behavior is transactional, then
we've basically got to replay each primary transaction in a single
standby transaction, and commit those transactions in the same order
that the corresponding primary transactions committed. We could
legally choose to merge a group of transactions that committed one
after the other on the primary into a single transaction on the
standby, and it might even be a good idea if they're all very tiny,
but it's not required. But if there are non-transactional things
happening, then there are changes that become visible at some time
other than at a transaction commit. For example, consider this
sequence of events, in which each "thing" that happens is
transactional except where the contrary is noted:

T1: BEGIN;
T2: BEGIN;
T1: Do thing 1;
T2: Do thing 2;
T1: Do a non-transactional thing;
T1: Do thing 3;
T2: Do thing 4;
T2: COMMIT;
T1: COMMIT;

From the point of the user here, there are 4 observable states here:

S1: Initiate state.
S2: State after the non-transactional thing happens.
S3: State after T2 commits (reflects the non-transactional thing plus
things 2 and 4).
S4: State after T1 commits.

Basically, the non-transactional thing behaves a whole lot like a
separate transaction. That non-transactional operation ought to be
replicated before T2, which ought to be replicated before T1. Maybe
logical replication ought to treat it in exactly that way: as a
separate operation that needs to be replicated after any earlier
transactions that completed prior to the history shown here, but
before T2 or T1. Alternatively, you can merge the non-transactional
change into T2, i.e. the first transaction that committed after it
happened. But you can't merge it into T1, even though it happened in
T1. If you do that, then you're creating states on the standby that
never existed on the primary, which is wrong. You could argue that
this is just nitpicking: who cares if the change in the sequence value
doesn't get replicated at exactly the right moment? But I don't think
it's a technicality at all: I think if we don't make the operation
appear to happen at the same point in the sequence as it became
visible on the master, then there will be endless artifacts and corner
cases to the bottom of which we will never get. Just like if we
replicated the actual transactions out of order, chaos would ensue,
because there can be logical dependencies between them, so too can
there be logical dependencies between non-transactional operations, or
between a non-transactional operation and a transactional operation.

To make it more concrete, consider two sessions concurrently running this SQL:

insert into t1 select nextval('s1') from generate_series(1,1000000) g;

There are, in effect, 2000002 transaction-like things here. The
sequence gets incremented 2 million times, and then there are 2
commits that each insert a million rows. Perhaps the actual order of
events looks something like this:

1. nextval the sequence N times, where N >= 1 million
2. commit the first transaction, adding a million rows to t1
3. nextval the sequence 2 million - N times
4. commit the second transaction, adding another million rows to t1

Unless we replicate all of the nextval operations that occur in step 1
at the same time or prior to replicating the first transaction in step
2, we might end up making visible a state where the next value of the
sequence is less than the highest value present in the table, which
would be bad.

With that perhaps overly-long set of preliminaries, I'm going to move
on to talking about the implementation ideas which you mention. You
write that "the current solution is based on WAL-logging state of all
sequences incremented by the transaction at COMMIT" and then, it seems
to me, go on to demonstrate that it's simply incorrect. In my opinion,
the fundamental problem is that it doesn't look at the order that
things happened on the primary and do them in the same order on the
standby. Instead, it accepts that the non-transactional operations are
going to be replicated at the wrong time, and then tries to patch
around the issue by attempting to scrounge up the correct values at
some convenient point and use that data to compensate for our failure
to do the right thing at an earlier point. That doesn't seem like a
satisfying solution, and I think it will be hard to make it fully
correct.

Your alternative proposal says "The other option might be to make
these messages non-transactional, in which case we'd separate the
ordering from COMMIT ordering, evading the reordering problem." But I
don't think that avoids the reordering problem at all. Nor do I think
it's correct. I don't think you *can* separate the ordering of these
operations from the COMMIT ordering. They are, as I argue here,
essentially mini-commits that only bump the sequence value, and they
need to be replicated after the transactions that commit prior to the
sequence value bump and before those that commit afterward. If they
aren't handled that way, I don't think you're going to get fully
correct behavior.

I'm going to confess that I have no really specific idea how to
implement that. I'm just not sufficiently familiar with this code.
However, I suspect that the solution lies in changing things on the
decoding side rather than in the WAL format. I feel like the
information that we need in order to do the right thing must already
be present in the WAL. If it weren't, then how could crash recovery
work correctly, or physical replication? At any given moment, you can
choose to promote a physical standby, and at that point the state you
observe on the new primary had better be some state that existed on
the primary at some point in its history. At any moment, you can
unplug the primary, restart it, and run crash recovery, and if you do,
you had better end up with some state that existed on the primary at
some point shortly before the crash. I think that there are actually a
few subtle inaccuracies in the last two sentences, because actually
the order in which transactions become visible on a physical standby
can differ from the order in which it happens on the primary, but I
don't think that actually changes the picture much. The point is that
the WAL is the definitive source of information about what happened
and in what order it happened, and we use it in that way already in
the context of physical replication, and of standbys. If logical
decoding has a problem with some case that those systems handle
correctly, the problem is with logical decoding, not the WAL format.

In particular, I think it's likely that the "non-transactional
messages" that you mention earlier don't get applied at the point in
the commit sequence where they were found in the WAL. Not sure why
exactly, but perhaps the point at which we're reading WAL runs ahead
of the decoding per se, or something like that, and thus those
non-transactional messages arrive too early relative to the commit
ordering. Possibly that could be changed, and they could be buffered
until earlier commits are replicated. Or else, when we see a WAL
record for a non-transactional sequence operation, we could arrange to
bundle that operation into an "adjacent" replicated transaction i.e.
the transaction whose commit record occurs most nearly prior to, or
most nearly after, the WAL record for the operation itself. Or else,
we could create "virtual" transactions for such operations and make
sure those get replayed at the right point in the commit sequence. Or
else, I don't know, maybe something else. But I think the overall
picture is that we need to approach the problem by replicating changes
in WAL order, as a physical standby would do. Saying that a change is
"nontransactional" doesn't mean that it's exempt from ordering
requirements; rather, it means that that change has its own place in
that ordering, distinct from the transaction in which it occurred.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2022-11-16 21:13:27 Re: About displaying NestLoopParam
Previous Message Tom Lane 2022-11-16 20:46:44 Re: Making Vars outer-join aware