Prepared transaction releasing locks before deregistering its GID

From: Oleksii Kliukin <alexk(at)hintbits(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Prepared transaction releasing locks before deregistering its GID
Date: 2019-02-18 16:05:13
Message-ID: BF9B38A4-2BFF-46E8-BA87-A2D00A8047A6@hintbits.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

We have an app that copies some metrics between predefined tables on the
source and destination clusters, truncating the table at the source
afterward.

The app locks both the source and the destination table at the beginning and
then, once copy concludes, prepares a transaction on each of the affected
hosts and commits it at last. The naming schema for the prepared transaction
involves the source and the destination table; while it doesn’t guarantee
the uniqueness if multiple copy processes run on the same host, the
possibility of that happening should be prevented by the locks held on
tables involved.

Or so we thought. From time to time, our cron jobs reported a transaction
identifier already in use. After digging into this, my colleague Ildar Musin
has produced a simple pgbench script to reproduce the issue:

----------------------------------------------------------------
$ cat custom.sql
begin;
lock abc in exclusive mode;
insert into abc values ((random() * 1000)::int);

prepare transaction 'test transaction';
commit prepared 'test transaction’;
----------------------------------------------------------------

pgbench launched with:

pgbench -T 20 -c 10 -f custom.sql postgres

(after creating the table abc) produces a number of complaints:

ERROR: transaction identifier "test transaction" is already in use.

I could easily reproduce that on Linux, but not Mac OS, with both Postgres
10.7 and 11.2.

That looks like a race condition to me. What happens is that another
transaction with the name identical to the running one can start and proceed
to the prepare phase while the original one commits, failing at last instead
of waiting for the original one to finish.

By looking at the source code of FinishPreparedTransaction() I can see the
RemoveGXact() call, which removes the prepared transaction from
TwoPhaseState->prepXacts. It is placed at the very end of the function,
after the post-commit callbacks that clear out the locks held by the
transaction. Those callbacks are not guarded by the TwoPhaseStateLock,
resulting in a possibility for a concurrent session to proceed will
MarkAsPreparing after acquiring the locks released by them.

I couldn’t find any documentation on the expected outcome in cases like
this, so I assume it might not be a bug, but an undocumented behavior.

Should I go about and produce a patch to put a note in the description of
commit prepared, or is there any interest in changing the behavior to avoid
such conflicts altogether (perhaps by holding the lock throughout the
cleanup phase)?

Regards,
Oleksii Kliukin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-02-18 16:18:19 Re: Reporting script runtimes in pg_regress
Previous Message Bernd Helmle 2019-02-18 15:59:25 Re: Progress reporting for pg_verify_checksums