Re: [HACKERS] logical decoding of two-phase transactions

From: Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
To: Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Sokolov Yura <y(dot)sokolov(at)postgrespro(dot)ru>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] logical decoding of two-phase transactions
Date: 2018-08-12 18:59:30
Message-ID: 87zhxrwgvh.fsf@ars-thinkpad
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> writes:

>> - Decoding transactions at PREPARE record changes rules of the "we ship
>> all commits after lsn 'x'" game. Namely, it will break initial
>> tablesync: what if consistent snapshot was formed *after* PREPARE, but
>> before COMMIT PREPARED, and the plugin decides to employ 2pc? Instead
>> of getting inital contents + continious stream of changes the receiver
>> will miss the prepared xact contents and raise 'prepared xact doesn't
>> exist' error. I think the starting point to address this is to forbid
>> two-phase decoding of xacts with lsn of PREPARE less than
>> snapbuilder's start_decoding_at.
> It will be the job of the plugin to return a consistent answer for
> every GID that is encountered. In this case, the plugin will decode
> the transaction at COMMIT PREPARED time and not at PREPARE time.

I can't imagine a scenario in which plugin would want to send COMMIT
PREPARED instead of replaying xact fully on CP record given it had never
seen PREPARE record. On the other hand, tracking such situations on
plugins side would make plugins life unneccessary complicated: either it
has to dig into snapbuilder/slot internals to learn when the snapshot
became consistent (which currently is impossible as this lsn is not
saved anywhere btw), or it must fsync each its decision to do or not to
do 2PC.

Technically, my concern covers not only tablesync, but just plain
decoding start: we don't want to ship COMMIT PREPARED if the downstream
had never had chance to see PREPARE.

As for tablesync, looking at current implementation I contemplate that
we would need to do something along the following lines:
- Tablesync worker performs COPY.
- It then speaks with main apply worker to arrange (origin)
lsn of sync point, as it does now.
- Tablesync worker applies changes up to arranged lsn; it never uses
two-phase decoding, all xacts are replayed on COMMIT PREPARED.
Moreover, instead of going into SYNCDONE state immediately after
reaching needed lsn, it stops replaying usual commits but continues
to receive changes to finish all transactions which were prepared
before sync point (we would need some additional support from
reorderbuffer to learn when this happens). Only then it goes into
- Behaviour of the main apply worker doesn't change: it
ignores changes of the table in question before sync point and
applies them after sync point. It also can use 2PC decoding of any
transaction or not, as it desires.
I believe this approach would implement tablesync correctly (all changes
are applied, but only once) with minimal fuss.

>> - Currently we will call abort_prepared cb even if we failed to actually
>> prepare xact due to concurrent abort. I think it is confusing for
>> users. We should either handle this by remembering not to invoke
>> abort_prepared in these cases or at least document this behaviour,
>> leaving this problem to the receiver side.
> The point is, when we reach the "ROLLBACK PREPARED", we have no idea
> if the "PREPARE" was aborted by this rollback happening concurrently.
> So it's possible that the 2PC has been successfully decoded and we
> would have to send the rollback to the other side which would need to
> check if it needs to rollback locally.

I understand this. But I find this confusing for the users, so I propose
- Either document that "you might get abort_prepared cb called even
after abort cb was invoked for the same transaction";
- Or consider adding some infrastructure to reorderbuffer to
remember not to call abort_prepared in these cases. Due to possible
reboots, I think this means that we need not to
ReorderBufferCleanupTXN immediately after failed attempt to replay
xact on PREPARE, but mark it as 'aborted' and keep it until we see
ABORT PREPARED record. If we see that xact is marked as aborted, we
don't call abort_prepared_cb. That way even if the decoder restarts
in between, we will see PREPARE in WAL, inquire xact status (even
if we skip it as already replayed) and mark it as aborted again.

>> - I find it suspicious that DecodePrepare completely ignores actions of
>> SnapBuildCommitTxn. For example, to execute invalidations, the latter
>> sets base snapshot if our xact (or subxacts) did DDL and the snapshot
>> not set yet. My fantasy doesn't hint me the concrete example
>> where this would burn at the moment, but it should be considered.
> I had discussed this area with Petr and we didn't see any issues as well, then.

Ok, simplifying, what SnapBuildCommitTxn practically does is
* Decide whether we are interested in tracking this xact effects, and
if we are, mark it as committed.
* Build and distribute snapshot to all RBTXNs, if it is important.
* Set base snap of our xact if it did DDL, to execute invalidations
during replay.

I see that we don't need to do first two bullets during DecodePrepare:
xact effects are still invisible for everyone but itself after
PREPARE. As for seeing xacts own own changes, it is implemented via
logging cmin/cmax and we don't need to mark xact as committed for that
(c.f ReorderBufferCopySnap).

Regarding the third point... I think in 2PC decoding we might need to
execute invalidations twice:
1) After replaying xact on PREPARE to forget about catalog changes
xact did -- it is not yet committed and must be invisible to
other xacts until CP. In the latest patchset invalidations are
executed only if there is at least one change in xact (it has base
snap). It looks fine: we can't spoil catalogs if there were nothing
to decode. Better to explain that somewhere.
2) After decoding COMMIT PREPARED to make changes visible. In current
patchset it is always done. Actually, *this* is the reason
RBTXN might already exist when we enter ReorderBufferFinishPrepared,
not "(during restarts for example)" as comment says there:
if there were inval messages, RBTXN will be created
in DecodeCommit during their addition.

BTW, "that we might need to execute invalidations, add snapshot" in
SnapBuildCommitTxn looks like a cludge to me: I suppose it is better to
do that at ReorderBufferXidSetCatalogChanges.

Now, another issue is registering xact as committed in
SnapBuildCommitTxn during COMMIT PREPARED processing. Since RBTXN is
always purged after xact replay on PREPARE, the only medium we have for
noticing catalog changes during COMMIT PREPARED is invalidation messages
attached to the CP record. This raises the following question.
* If there is a guarantee that whenever xact makes catalog changes it
generates invalidation messages, then this code is fine. However,
currently ReorderBufferXidSetCatalogChanges is also called on
XLOG_HEAP_INPLACE processing and in SnapBuildProcessNewCid, which
is useless if such guarantee exists.
* If, on the other hand, there is no such guarantee, this code is

>> - I am one of those people upthread who don't think that converting
>> flags to bitmask is beneficial -- especially given that many of them
>> are mutually exclusive, e.g. xact can't be committed and aborted at
>> the same time. Apparently you have left this to the committer though.
> Hmm, there seems to be divided opinion on this. I am willing to go
> back to using the booleans if there's opposition and if the committer
> so wishes. Note that this patch will end up adding 4/5 more booleans
> in that case (we add new ones for prepare, commit prepare, abort,
> rollback prepare etc).

Well, you can unite mutually exclusive fields into one enum or char with
macros defining possible values. Transaction can't be committed and
aborted at the same time, etc.

>> + row. The <function>change_cb</function> callback may access system or
>> + user catalog tables to aid in the process of outputting the row
>> + modification details. In case of decoding a prepared (but yet
>> + uncommitted) transaction or decoding of an uncommitted transaction, this
>> + change callback is ensured sane access to catalog tables regardless of
>> + simultaneous rollback by another backend of this very same transaction.
>> I don't think we should explain this, at least in such words. As
>> mentioned upthread, we should warn about allowed systable_* accesses
>> instead. Same for message_cb.
> Looks like you are looking at an earlier patchset. The latest patchset
> has removed the above.

I see, sorry.

Arseny Sher
Postgres Professional:
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-08-12 19:08:50 Re: Allowing printf("%m") only where it actually works
Previous Message Tom Lane 2018-08-12 14:49:03 Re: Allowing printf("%m") only where it actually works