|From:||Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>|
|To:||Craig Ringer <craig(at)2ndquadrant(dot)com>|
|Cc:||Sokolov Yura <y(dot)sokolov(at)postgrespro(dot)ru>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>|
|Subject:||Re: [HACKERS] logical decoding of two-phase transactions|
|Views:||Raw Message | Whole Thread | Download mbox|
> I think Nikhils has done some significant work on this patch.
> Hopefully he'll be able to share it.
PFA, latest patch. This builds on top of the last patch submitted by
Sokolov Yura and adds the actual logical replication interfaces to
allow PREPARE or COMMIT/ROLLBACK PREPARED on a logical subscriber.
I tested with latest PG head by setting up PUBLICATION/SUBSCRIPTION
for some tables. I tried DML on these tables via 2PC and it seems to
work with subscribers honoring COMMIT|ROLLBACK PREPARED commands.
Now getting back to the two main issues that we have been discussing:
Logical decoding deadlocking/hanging due to locks on catalog tables
When we are decoding, we do not hold long term locks on the table. We
do RelationIdGetRelation() and RelationClose() which
increments/decrements ref counts. Also this ref count is held/released
per ReorderBuffer change record. The call to RelationIdGetRelation()
holds an AccessShareLock on pg_class, pg_attribute etc. while building
the relation descriptor. The plugin itself can access rel/syscache but
none of it holds a lock stronger than AccessShareLock on the catalog
Even activities like:
Do not hold locks that will allow decoding to stall.
The only issue could be with locks on catalog objects itself in the
Now if the 2PC transaction is taking an AccessExclusiveLock on catalog
objects via "LOCK pg_class"
for example, then pretty much nothing else will progress ahead in
other sessions in the database
till this active session COMMIT PREPAREs or aborts this 2PC transaction.
Also, in some cases like CLUSTER on catalog objects, the code
explicitly denies preparation of a 2PC transaction.
postgres=# CLUSTER pg_class using pg_class_oid_index ;
postgres=# PREPARE TRANSACTION 'test_prepared_lock';
ERROR: cannot PREPARE a transaction that modified relation mapping
This makes sense because we do not want to get into a state where the
DB is unable to progress meaningfully at all.
Is there any other locking scenario that we need to consider?
Otherwise, are we all ok on this point being a non-issue for 2PC
Now on to the second issue:
2PC Logical decoding with concurrent "ABORT PREPARED" of the same
Before 2PC, we always decoded regular committed transaction records.
Now with prepared
transactions, we run the risk of running decoding when some other
backend could come in and
COMMIT PREPARE or ROLLBACK PREPARE simultaneously. If the other backend commits,
that's not an issue at all.
The issue is with a concurrent rollback of the prepared transaction.
We need a way to ensure that
the 2PC does not abort when we are in the midst of a change record
One way to handle this is to ensure that we interlock the abort
prepared with an ongoing logical decoding operation for a bounded
period of maximum one change record apply cycle.
I am outlining one solution but am all ears for better, elegant solutions.
* We introduce two new booleans in the TwoPhaseState
1) Before we start iterating through the change records, if it happens
to be a prepared transaction, we
check "abortpending" in the corresponding TwoPhaseState entry. If it's
not set, then we set "beingdecoded".
If abortpending is set, we know that this transaction is going to go
away and we treat it like a regular abort and do
not do any decoding at all.
2) With "beingdecoded" set, we start with the first change record from
the iteration, decode it and apply it.
3) Before starting decode of the next change record, we re-check if
"abortpending" is set. If "abortpending"
is set, we do not decode the next change record. Thus the abort is
delay-bounded to a maximum of one change record decoding/apply cycle
after we signal our intent to abort it. Then, we need to send ABORT
(regular, not rollback prepared, since we have not sent "PREPARE" yet.
We cannot send PREPARE midways because the transaction block on the
whole might not be consistent) to the subscriber. We will have to add
an ABORT callback in pgoutput for this. There's only a COMMIT callback
as of now. The subscribers will ABORT this transaction midways due to
this. We can then follow this up with a DUMMY prepared txn. E.g.
"BEGIN; PREPARE TRANSACTION 'gid'"; The reasoning for the DUMMY 2PC is
mentioned below in (6).
4) Keep decoding change records as long as "abortpending" is not set.
5) At end of the change set, send "PREPARE" to the subscribers and
then remove the "beingdecoded" flag from the TwoPhaseState entry. We
are now free to commit/rollback the prepared transaction anytime.
6) We will still decode the "ROLLBACK PREPARED" wal entry when it
comes to us on the provider. This will call the abort_prepared
callback on the subscriber. I have already added this in my patch.
This abort_prepared callback will abort the dummy PREPARED query from
step (3) above. Instead of doing this, we could actually check if the
'GID' entry exists and then call ROLLBACK PREPARED on the subscriber.
But in that case we can't be sure if the GID does not exist because of
a rollback-during-decode-issue on the provider or due to something
else. If we are ok with not finding GIDs on the subscriber side, then
am fine with removing the DUMMY prepare from step (3).
7) When the above activity is happening if another backend wants to
abort the prepared transaction then it will set "abortpending". If
"beingdecoded" is true, the abort prepared function will wait till it
clears out by releasing the lock and re-checking in a few moments.
When beingdecoded clears out (which will happen before the next change
record apply in walsender when it sees "abortpending" set) , the abort
prepare can go ahead as usual.
Note that we will have to be careful to clear this "beingdecoded" flag
even if the decoding fails or subscription is dropped or any other
issues. Then this can work fine, IMO.
Thoughts? Holes in the theory? Other issues?
I am attaching my latest and greatest WIP patch with does not contain
any of the above abort handling yet.
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
|Next Message||Amit Khandekar||2017-11-23 12:57:16||Re: [HACKERS] UPDATE of partition key|
|Previous Message||amul sul||2017-11-23 11:48:44||Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key|