Re: [HACKERS] logical decoding of two-phase transactions

From: Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Sokolov Yura <y(dot)sokolov(at)postgrespro(dot)ru>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: [HACKERS] logical decoding of two-phase transactions
Date: 2018-03-28 15:28:33
Message-ID: CAMGcDxcGtRXtF1r3VqddzHba+BTm7q8wbgEsyhmOFuDckMTZFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi Tomas,

> Now, about the interlock implementation - I see you've reused the "lock
> group" concept from parallel query. That may make sense, unfortunately
> there's about no documentation explaining how it works, what is the
> "protocol" etc. There is fairly extensive documentation for "lock
> groups" in src/backend/storage/lmgr/README, but while the "decoding
> group" code is inspired by it, the code is actually very different.
> Compare for example BecomeLockGroupLeader and BecomeDecodeGroupLeader,
> and you'll see what I mean.
>
> So I think the first thing we need to do is add proper documentation
> (possibly into the same README), explaining how the decode groups work,
> how the decodeAbortPending works, etc.
>

I have added details about this in src/backend/storage/lmgr/README as
suggested by you.

>
> BTW, do we need to do any of this with (wal_level < logical)? I don't
> see any quick bail-out in any of the functions in this case, but it
> seems like a fairly obvious optimization.
>

The calls to the LogicalLockTransaction/LogicalUnLockTransaction APIs
will be from inside plugins or the reorderbuffer code paths. Those
will get invoked only in the wal_level logical case, hence I did not
add further checks.

> Similarly, can't the logical workers indicate that they need to decode
> 2PC transactions (or in-progress transactions in general) in some way?
> If we knew there are no such workers, that would also allow ignoring the
> interlock, no?
>

These APIs check if the transaction is already committed and cache
that information for further calls, so for regular transactions this
becomes a no-op

>
> decoding 2PC transactions
> =========================
>
> Now, the main topic of the patch. Overall the changes make sense, I
> think - it modifies about the same places I touched in the streaming
> patch, in similar ways.
>
> The following comments are mostly in random order:
>
> 1) test_decoding.c
> ------------------
>
> The "filter" functions do not follow the naming convention, so I suggest
> to rename them like this:
>
> - pg_filter_decode_txn -> pg_decode_filter_txn
> - pg_filter_prepare -> pg_decode_filter_prepare_txn
>
> or something like that. Also, looking at those functions (and those same
> callbacks in the pgoutput plugin) I wonder if we really need to make
> them part of the output plugin API.
>
> I mean, AFAICS their only purpose is to filter 2PC transactions, but I
> don't quite see why implementing those checks should be responsibility
> of the plugin? I suppose it was done to make test_decoding customizable
> (i.e. allow enabling/disabling of decoding 2PC as needed), right?
>
> In that case I suggest make it configurable by plugin-level flags (I see
> LogicalDecodingContext already has a enable_twophase), and moving the
> checks to a function that is not part of the plugin API. Of course, in
> that case the flag needs to be customizable from plugin options, not
> just "Does the plugin have all the callbacks?".
>

The idea behind exposing the API is to allow the plugins to have
selective control over specific 2PC actions. They might want to decode
certain 2PC but not some others. By providing this callback, they can
do that selectively.

> The "twophase-decoding" and "twophase-decode-with-catalog-changes" seem
> a bit inconsistently named too (why decode vs. decoding?).
>

This has been removed in the latest patches altogether. Maybe you were
referring to an older patch.

>
> 2) regression tests
> -------------------
>
> I really dislike the use of \set to run the same query repeatedly. It
> makes analysis of regression failures even more tedious than it already
> is. I'd just copy the query to all the places.
>

They are long-winded queries and IMO made the test file look too
cluttered and verbose..

>
> 3) worker.c
> -----------
>
> The comment in apply_handle_rollback_prepared_txn says this:
>
> /*
> * During logical decoding, on the apply side, it's possible that a
> * prepared transaction got aborted while decoding. In that case, we
> * stop the decoding and abort the transaction immediately. However
> * the ROLLBACK prepared processing still reaches the subscriber. In
> * that case it's ok to have a missing gid
> */
> if (LookupGXact(commit_data->gid)) { ... }
>
> But is it safe to assume it never happens due to an error? In other
> words, is there a way to decide that the GID really aborted? Or, why
> should the provider sent the rollback at all - surely it could know if
> the transaction/GID was sent to subscriber or not, right?
>

Since we decode in commit WAL order, when we reach the ROLLBACK
PREPARED wal record, we cannot be sure that we did infact abort the
decoding mid ways because of this concurrent rollback. It's possible
that this rollback comes much much later as well when all decoding
backends have successfully prepared it on the subscribers already.

>
> 4) twophase.c
> -------------
>
> I wonder why the patch modifies the TWOPHASE_MAGIC at all - if it's
> meant to identify 2PC files, then why not to keep the value. And if we
> really need to modify it, why not to use another random number? By only
> adding 1 to the current one, it makes it look like a random bit flip.
>

We could retain the existing magic here.

>
> 5) decode.c
> -----------
>
> The changes in DecodeCommit need proper comments.
>
> In DecodeAbort, the "if" includes this condition:
>
> ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)
>
> which essentially means ROLLBACK PREPARED is translated into "is the
> transaction prepared?. Shouldn't the code look at xl_xact_parsed_abort
> instead, and make the ReorderBufferTxnIsPrepared an Assert?

This again goes back to the earlier callback in which want the
pg_decode_filter_prepare_txn to selectively decide to filter out or
decode some of the 2PC transactions. If we allow that callback, then
we need to consult ReorderBufferTxnIsPrepared to get the same response
for these 2PC transactions.

>
>
> 6) logical.c
> ------------
>
> I see StartupDecodingContext does this:
>
> twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
> (ctx->callbacks.commit_prepared_cb != NULL) +
> (ctx->callbacks.abort_prepared_cb != NULL);
>
> It seems a bit strange to make arithmetics on bools, I guess. In any
> case, I think this should be an ERROR and not a WARNING:
>
> if (twophase_callbacks != 3 && twophase_callbacks != 0)
> ereport(WARNING,
> (errmsg("Output plugin registered only %d twophase callbacks. "
> "Twophase transactions will be decoded at commit time.",
> twophase_callbacks)));
>
> A plugin that implements only a subset of the callbacks seems outright
> broken, so let's just fail.
>

Ok, done.

>
> 7) proto.c / worker.c
> ---------------------
>
> Until now, the 'action' (essentially the first byte of each message)
> clearly identified what the message does. So 'C' -> commit, 'I' ->
> insert, 'D' -> delete etc. This also means the "handle" methods were
> inherently simple, because each handled exactly one particular action
> and nothing else.
>
> You've expanded the protocol in a way that suddenly 'C' means either
> COMMIT or ROLLBACK, and 'P' means PREPARE, ROLLBACK PREPARED or COMMIT
> PREPARED. I don't think that's how the protocol should be extended - if
> anything, it's damn confusing and unlike the existing code. You should
> define new action, and keep the handlers in worker.c simple.
>

I thought this grouped regular commit and 2PC transactions properly.
Can look at this again if this style is not favored.

> Also, this probably implies LOGICALREP_PROTO_VERSION_NUM increase.
>

Ok, increased it to 2.

PFA, latest patch set. The ReorderBufferCommit() handling has been
further simplified now without worrying too much about optimizing for
abort handling at various steps.

This also contains an additional/optional 7th patch which has a test
case to solely demonstrate the concurrent abort/logical decoding
interlocking. It uses the delay using sleep logic while holding
LogicalTransactionLock. This additional patch might not be considered
for commit as the delay based approach is prone to failures on slower
machines.

Simon, 0003-Add-GID-and-replica-origin-to-two-phase-commit-abort.patch
is the exact patch that you had posted for an earlier commit.

Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patch application/octet-stream 7.3 KB
0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patch application/octet-stream 23.8 KB
0003-Add-GID-and-replica-origin-to-two-phase-commit-abort.patch application/octet-stream 19.1 KB
0004-Support-decoding-of-two-phase-transactions-at-PREPAR.patch application/octet-stream 31.5 KB
0005-pgoutput-output-plugin-support-for-logical-decoding-.patch application/octet-stream 33.6 KB
0006-Teach-test_decoding-plugin-to-work-with-2PC.patch application/octet-stream 20.9 KB
0007-Additional-optional-test-case-to-demonstrate-decoding-rollbac.patch application/octet-stream 9.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ildus Kurbangaliev 2018-03-28 15:34:54 Re: committing inside cursor loop
Previous Message Erik Rijkers 2018-03-28 15:25:32 Re: WIP: Covering + unique indexes.