Re: [HACKERS] logical decoding of two-phase transactions

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Sokolov Yura <y(dot)sokolov(at)postgrespro(dot)ru>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: [HACKERS] logical decoding of two-phase transactions
Date: 2018-03-22 01:22:47
Message-ID: c167340c-0751-d54b-0f32-fa83fd414863@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Nikhil,

I've been looking at this patch over the past few days, so here are my
thoughts so far ...

decoding aborted transactions
=============================

First, let's talk about handling of aborted transaction, which was
originally discussed in thread [1]. I'll try to summarize the status and
explain my understanding of the choices first.

[1]
https://www.postgresql.org/message-id/CAMGcDxeHBaXCz12LdfEmyJdghbms_dtC26pRZXKWRV2dazO-UQ%40mail.gmail.com

There were multiple ideas about how to deal with aborted transactions,
but I we eventually found various issues in all of them except for two -
interlocking decoding and aborts, and modifying the rules so that
aborted transactions are considered to be running while being decoded.

This patch uses the first approach, i.e. interlock. It has a couple of
disadvantages:

a) The abort may need to wait for decoding workers for a while.

This is annoying, but aborts are generally rare. And for systems with
many concurrent short transactions (where even tiny delays would matter)
it's unlikely the decoding workers will already start decoding the
aborted transaction.

b) output plugins need to call lock/unlock explicitly from the callbacks

Technically, we could wrap the whole callback in a lock/unlock, but that
would needlessly increase the amount of time spent holding the lock,
making the previous point much worse. As the callbacks are expected to
do network I/O etc. the amount of time could be quite significant.

The main disadvantage is of course that it's likely much less invasive
than tweaking which transactions are seen as running. So I think taking
this approach is a sensible choice at this point.

Now, about the interlock implementation - I see you've reused the "lock
group" concept from parallel query. That may make sense, unfortunately
there's about no documentation explaining how it works, what is the
"protocol" etc. There is fairly extensive documentation for "lock
groups" in src/backend/storage/lmgr/README, but while the "decoding
group" code is inspired by it, the code is actually very different.
Compare for example BecomeLockGroupLeader and BecomeDecodeGroupLeader,
and you'll see what I mean.

So I think the first thing we need to do is add proper documentation
(possibly into the same README), explaining how the decode groups work,
how the decodeAbortPending works, etc.

Also, some function names seem a bit misleading. For example in the lock
group "BecomeLockGroupLeader" means (make the current process a group
leader), but apparently "BecomeDecodeGroupLeader" means "find the
process handling XID and make it a leader". But perhaps I got that
entirely wrong.

Of course LogicalLockTransaction and LogicalUnlockTransaction, should
have proper comments, which is particularly important as it's part of
the public API.

BTW, do we need to do any of this with (wal_level < logical)? I don't
see any quick bail-out in any of the functions in this case, but it
seems like a fairly obvious optimization.

Similarly, can't the logical workers indicate that they need to decode
2PC transactions (or in-progress transactions in general) in some way?
If we knew there are no such workers, that would also allow ignoring the
interlock, no?

Another thing is that I'm yet to see any performance tests. While we do
believe it will work fine, it's based on a number of assumptions:

a) aborts are rare
b) it has no measurable impact on commit

I think we need to verify this by actually measuring the impact on a
bunch of workloads. In particular, I think we need to test

i) impact on commit-only workloads
ii) impact on worst-case scenario

I'm not sure how (ii) would look like, considering the patch only deals
with decoding 2PC transactions, which have significant overhead on their
own - so I'm afraid the impact on "regular transactions" might be much
worse, once we add support for that.

decoding 2PC transactions
=========================

Now, the main topic of the patch. Overall the changes make sense, I
think - it modifies about the same places I touched in the streaming
patch, in similar ways.

The following comments are mostly in random order:

1) test_decoding.c
------------------

The "filter" functions do not follow the naming convention, so I suggest
to rename them like this:

- pg_filter_decode_txn -> pg_decode_filter_txn
- pg_filter_prepare -> pg_decode_filter_prepare_txn

or something like that. Also, looking at those functions (and those same
callbacks in the pgoutput plugin) I wonder if we really need to make
them part of the output plugin API.

I mean, AFAICS their only purpose is to filter 2PC transactions, but I
don't quite see why implementing those checks should be responsibility
of the plugin? I suppose it was done to make test_decoding customizable
(i.e. allow enabling/disabling of decoding 2PC as needed), right?

In that case I suggest make it configurable by plugin-level flags (I see
LogicalDecodingContext already has a enable_twophase), and moving the
checks to a function that is not part of the plugin API. Of course, in
that case the flag needs to be customizable from plugin options, not
just "Does the plugin have all the callbacks?".

The "twophase-decoding" and "twophase-decode-with-catalog-changes" seem
a bit inconsistently named too (why decode vs. decoding?).

2) regression tests
-------------------

I really dislike the use of \set to run the same query repeatedly. It
makes analysis of regression failures even more tedious than it already
is. I'd just copy the query to all the places.

3) worker.c
-----------

The comment in apply_handle_rollback_prepared_txn says this:

/*
* During logical decoding, on the apply side, it's possible that a
* prepared transaction got aborted while decoding. In that case, we
* stop the decoding and abort the transaction immediately. However
* the ROLLBACK prepared processing still reaches the subscriber. In
* that case it's ok to have a missing gid
*/
if (LookupGXact(commit_data->gid)) { ... }

But is it safe to assume it never happens due to an error? In other
words, is there a way to decide that the GID really aborted? Or, why
should the provider sent the rollback at all - surely it could know if
the transaction/GID was sent to subscriber or not, right?

4) twophase.c
-------------

I wonder why the patch modifies the TWOPHASE_MAGIC at all - if it's
meant to identify 2PC files, then why not to keep the value. And if we
really need to modify it, why not to use another random number? By only
adding 1 to the current one, it makes it look like a random bit flip.

5) decode.c
-----------

The changes in DecodeCommit need proper comments.

In DecodeAbort, the "if" includes this condition:

ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)

which essentially means ROLLBACK PREPARED is translated into "is the
transaction prepared?. Shouldn't the code look at xl_xact_parsed_abort
instead, and make the ReorderBufferTxnIsPrepared an Assert?

6) logical.c
------------

I see StartupDecodingContext does this:

twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
(ctx->callbacks.commit_prepared_cb != NULL) +
(ctx->callbacks.abort_prepared_cb != NULL);

It seems a bit strange to make arithmetics on bools, I guess. In any
case, I think this should be an ERROR and not a WARNING:

if (twophase_callbacks != 3 && twophase_callbacks != 0)
ereport(WARNING,
(errmsg("Output plugin registered only %d twophase callbacks. "
"Twophase transactions will be decoded at commit time.",
twophase_callbacks)));

A plugin that implements only a subset of the callbacks seems outright
broken, so let's just fail.

7) proto.c / worker.c
---------------------

Until now, the 'action' (essentially the first byte of each message)
clearly identified what the message does. So 'C' -> commit, 'I' ->
insert, 'D' -> delete etc. This also means the "handle" methods were
inherently simple, because each handled exactly one particular action
and nothing else.

You've expanded the protocol in a way that suddenly 'C' means either
COMMIT or ROLLBACK, and 'P' means PREPARE, ROLLBACK PREPARED or COMMIT
PREPARED. I don't think that's how the protocol should be extended - if
anything, it's damn confusing and unlike the existing code. You should
define new action, and keep the handlers in worker.c simple.

Also, this probably implies LOGICALREP_PROTO_VERSION_NUM increase.

8) reorderbuffer.h/c
--------------------

Similarly, I wonder why you replaced the ReorderBuffer boolean flags
(is_known_subxact, has_catalog_changes) with a bitmask? I find it way
more difficult to read (which is subjective, of course) but it also
makes IDEs dumber (suddenly they can't offer you field names).

Surely it wasn't done to save space, because by using an "int" you've
saved just 4B (there are 8 flags right now, so it'd need 8 bytes with
plain bool flags) on a structure that is already ~200B.

And you the added gid[GIDSIZE] to it, making it 400B for *all*
transactions and subtransactions (not just 2PC). Not to mention that the
GID is usually much shorter than the 200B.

So I suggest to use just a simple (char *) pointer for the GID, keeping
it NULL for most transactions, and switching back to plain bool flags.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-03-22 01:27:11 Re: pg_get_functiondef forgets about most GUC_LIST_INPUT GUCs
Previous Message Michael Paquier 2018-03-22 01:20:39 Re: pg_get_functiondef forgets about most GUC_LIST_INPUT GUCs