Re: [HACKERS] logical decoding of two-phase transactions

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Sokolov Yura <y(dot)sokolov(at)postgrespro(dot)ru>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: [HACKERS] logical decoding of two-phase transactions
Date: 2018-03-29 21:52:18
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers


I've been reviewing the last patch version, focusing mostly on the
decoding group part. Let me respond to several points first, then new
review bits.

On 03/28/2018 05:28 PM, Nikhil Sontakke wrote:
> Hi Tomas,
>> Now, about the interlock implementation - I see you've reused the "lock
>> group" concept from parallel query. That may make sense, unfortunately
>> there's about no documentation explaining how it works, what is the
>> "protocol" etc. There is fairly extensive documentation for "lock
>> groups" in src/backend/storage/lmgr/README, but while the "decoding
>> group" code is inspired by it, the code is actually very different.
>> Compare for example BecomeLockGroupLeader and BecomeDecodeGroupLeader,
>> and you'll see what I mean.
>> So I think the first thing we need to do is add proper documentation
>> (possibly into the same README), explaining how the decode groups work,
>> how the decodeAbortPending works, etc.
> I have added details about this in src/backend/storage/lmgr/README as
> suggested by you.

Thanks. I think the README is a good start, but I think we also need to
improve the comments, which is usually more detailed than the README.
For example, it's not quite acceptable that LogicalLockTransaction and
LogicalUnlockTransaction have about no comments, especially when it's
meant to be public API for decoding plugins.

>> BTW, do we need to do any of this with (wal_level < logical)? I don't
>> see any quick bail-out in any of the functions in this case, but it
>> seems like a fairly obvious optimization.
> The calls to the LogicalLockTransaction/LogicalUnLockTransaction APIs
> will be from inside plugins or the reorderbuffer code paths. Those
> will get invoked only in the wal_level logical case, hence I did not
> add further checks.

Oh, right.

>> Similarly, can't the logical workers indicate that they need to decode
>> 2PC transactions (or in-progress transactions in general) in some way?
>> If we knew there are no such workers, that would also allow ignoring the
>> interlock, no?
> These APIs check if the transaction is already committed and cache
> that information for further calls, so for regular transactions this
> becomes a no-op

I see. So when the output plugin never calls LogicalLockTransaction on
an in-progress transaction (e.g. 2PC after PREPARE), it never actually
initializes the decoding group. Works for me.

>> 2) regression tests
>> -------------------
>> I really dislike the use of \set to run the same query repeatedly. It
>> makes analysis of regression failures even more tedious than it already
>> is. I'd just copy the query to all the places.
> They are long-winded queries and IMO made the test file look too
> cluttered and verbose..

Well, I don't think that's a major problem, and it certainly makes it
more difficult to investigate regression failures.

>> 3) worker.c
>> -----------
>> The comment in apply_handle_rollback_prepared_txn says this:
>> /*
>> * During logical decoding, on the apply side, it's possible that a
>> * prepared transaction got aborted while decoding. In that case, we
>> * stop the decoding and abort the transaction immediately. However
>> * the ROLLBACK prepared processing still reaches the subscriber. In
>> * that case it's ok to have a missing gid
>> */
>> if (LookupGXact(commit_data->gid)) { ... }
>> But is it safe to assume it never happens due to an error? In other
>> words, is there a way to decide that the GID really aborted? Or, why
>> should the provider sent the rollback at all - surely it could know if
>> the transaction/GID was sent to subscriber or not, right?
> Since we decode in commit WAL order, when we reach the ROLLBACK
> PREPARED wal record, we cannot be sure that we did infact abort the
> decoding mid ways because of this concurrent rollback. It's possible
> that this rollback comes much much later as well when all decoding
> backends have successfully prepared it on the subscribers already.

Ah, OK. So when the transaction gets aborted (by ROLLBACK PREPARED)
concurrently with the decoding, we abort the apply transaction and
discard the ReorderBufferTXN.

Which means that later, when we decode the abort, we don't know whether
the decoding reached abort or prepare, and so we have to send the
ROLLBACK PREPARED to the subscriber too.

For a moment I was thinking we might simply remember TXN outcome in
reorder buffer, but obviously that does not work - the decoding might
restart in between, and as you say the distance (in terms of WAL) may be
quite significant.

>> 7) proto.c / worker.c
>> ---------------------
>> Until now, the 'action' (essentially the first byte of each message)
>> clearly identified what the message does. So 'C' -> commit, 'I' ->
>> insert, 'D' -> delete etc. This also means the "handle" methods were
>> inherently simple, because each handled exactly one particular action
>> and nothing else.
>> You've expanded the protocol in a way that suddenly 'C' means either
>> PREPARED. I don't think that's how the protocol should be extended - if
>> anything, it's damn confusing and unlike the existing code. You should
>> define new action, and keep the handlers in worker.c simple.
> I thought this grouped regular commit and 2PC transactions properly.
> Can look at this again if this style is not favored.

Hmmm, it's not how I'd do it, but perhaps someone who originally
designed the protocol should review this bit.

Now, the new bits ... attached is a .diff with a couple of changes and
comments on various places.

1) LogicalLockTransaction

- This function is part of a public API, yet it has no comment. That
needs fixing - it has to be clear how to use it. The .diff suggests a
comment, but it may need improvements.

- As I mentioned in the previous review, BecomeDecodeGroupLeader is a
misleading name. It suggest the called becomes a leader, while in fact
it looks up the PROC running the XID and makes it a leader. This is
obviously due to copying the code from lock groups, where the caller
actually becomes the leader. It's incorrect here. I suggest something
like LookupDecodeGroupLeader() or something.

- In the "if (MyProc->decodeGroupLeader == NULL)" block there are two
blocks rechecking the transaction status:

if (proc == NULL)
{ ... recheck ... }

if (!BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn)))
{ ... recheck ...}

I suggest to join them into a single block.

- This Assert() is either bogus and there can indeed be cases with
(MyProc->decodeGroupLeader==NULL), or the "if" is unnecessary:


if (MyProc->decodeGroupLeader) { ... }

- I'm wondering why we're maintaining decodeAbortPending flags both for
the leader and all the members. ISTM it'd be perfectly fine to only
check the leader, particularly because RemoveDecodeGroupMemberLocked
removes the members from the decoding group. So that seems unnecessary,
and we can remove the

if (MyProc->decodeAbortPending)
{ ... }

- LogicalUnlockTransaction needs a comment(s) too.

2) BecomeDecodeGroupLeader

- Wrong name (already mentioned above).

- It can bail out when (!proc), which will simplify the code a bit.

- Why does it check PID of the process at all? Seems unnecessary,
considering we're already checking the XID.

- Can a proc executing a XID have a different leader? I don't think so,
so I'd make that an Assert().

Assert(!proc || (proc->decodeGroupLeader == proc));

And it'll allow simplification of some of the conditions.

- We're only dealing with prepared transactions now, so I'd just drop
the is_prepared flag - it'll make the code a bit simpler, we can add it
later in patch adding decoding of regular in-progress transactions. We
can't test the (!is_prepared) anyway.

- Why are we making the leader also a member of the group? Seems rather
unnecessary, and it complicates the abort handling, because we need to
skip the leader when deciding to wait.

3) LogicalDecodeRemoveTransaction

- It's not clear to me what happens when a decoding backend gets killed
between LogicalLockTransaction/LogicalUnlockTransaction. Doesn't that
mean LogicalDecodeRemoveTransaction will get stuck, because the proc is
still in the decoding group?

- The loop now tweaks decodeAbortPending of the members, but I don't
think that's necessary either - the LogicalUnlockTransaction can check
the leader flag just as easily.

4) a bunch of comment / docs improvements, ...

I'm suggesting rewording a couple of comments. I've also added a couple
of missing comments - e.g. to LogicalLockTransaction and the lock group
methods in general.

Also, a couple more questions and suggestions in XXX comments.


Tomas Vondra
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
logical-2pc-decoding-review.diff text/x-patch 19.2 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-03-29 21:58:15 Re: [HACKERS] logical decoding of two-phase transactions
Previous Message Peter Eisentraut 2018-03-29 21:52:07 Re: Proposal: http2 wire format