From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com> |
Cc: | Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Race condition in TransactionIdIsInProgress |
Date: | 2022-06-23 19:03:27 |
Message-ID: | 3d97f657-8a8e-4694-d636-eeb29c2626f0@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 12/02/2022 05:42, Andres Freund wrote:
> On 2022-02-11 16:41:24 -0800, Andres Freund wrote:
>> FWIW, I've indeed reproduced this fairly easily with such a setup. A pgbench
>> r/w workload that's been modified to start 70 savepoints at the start shows
>>
>> pgbench: error: client 22 script 0 aborted in command 12 query 0: ERROR: t_xmin 3853739 is uncommitted in tuple (2,159) to be updated in table "pgbench_branches"
>> pgbench: error: client 13 script 0 aborted in command 12 query 0: ERROR: t_xmin 3954305 is uncommitted in tuple (2,58) to be updated in table "pgbench_branches"
>> pgbench: error: client 7 script 0 aborted in command 12 query 0: ERROR: t_xmin 4017908 is uncommitted in tuple (3,44) to be updated in table "pgbench_branches"
>>
>> after a few minutes of running with a local, not slowed down, syncrep. Without
>> any other artifical slowdowns or such.
>
> And this can easily be triggered even without subtransactions, in a completely
> reliable way.
>
> The only reason I'm so far not succeeding in turning it into an
> isolationtester spec is that a transaction waiting for SyncRep doesn't count
> as waiting for isolationtester.
>
> Basically
>
> S1: BEGIN; $xid = txid_current(); UPDATE; COMMIT; <commit wait for syncrep>
> S2: SELECT pg_xact_status($xid);
> S2: UPDATE;
>
> suffices, because the pg_xact_status() causes an xlog fetch, priming the xid
> cache, which then causes the TransactionIdIsInProgress() to take the early
> return path, despite the transaction still being in progress. Which then
> allows the update to proceed, despite the S1 not having "properly committed"
> yet.
I started to improve isolationtester, to allow the spec file to specify
a custom query to check for whether the backend is blocked. But then I
realized that it's not quite enough: there is currently no way write the
"$xid = txid_current()" step either. Isolationtester doesn't have
variables like that. Also, you need to set synchronous_standby_names to
make it block, which seems a bit weird in an isolation test.
I wish we had an explicit way to inject delays or failures. We discussed
it before
(https://www.postgresql.org/message-id/flat/CANXE4TdxdESX1jKw48xet-5GvBFVSq%3D4cgNeioTQff372KO45A%40mail.gmail.com),
but I don't want to pick up that task right now.
Anyway, I wrote a TAP test for this instead. See attached. I'm not sure
if this is worth committing, the pg_xact_status() trick is quite special
for this particular bug.
Simon's just_remove_TransactionIdIsKnownCompleted_call.v1.patch looks
good to me. Replying to the discussion on that:
On 12/02/2022 23:50, Andres Freund wrote:
> On 2022-02-12 13:25:58 +0000, Simon Riggs wrote:
>> I'm not sure it is possible to remove TransactionIdIsKnownCompleted()
>> in backbranches.
>
> I think it'd be fine if we needed to. Or we could just make it always return
> false or such.
>
>
>>>> just removes the known offending call, passes make check, but IMHO
>>>> leaves the same error just as likely by other callers.
>>>
>>> Which other callers are you referring to?
>>
>> I don't know of any, but we can't just remove a function in a
>> backbranch, can we?
>
> We have in the past, if it's a "sufficiently internal" function.
I think we should remove it in v15, and leave it as it is in
backbranches. Just add a comment to point out that the name is a bit
misleading, because it checks the clog rather than the proc array. It's
not inherently dangerous, and someone might have a legit use case for
it. True, someone might also be doing this incorrect thing with it, but
still seems like the right balance to me.
I think we also need to change pg_xact_status(), to also call
TransactionIdIsInProgress() before TransactionIdDidCommit/Abort().
Currently, if it returns "committed", and you start a new transaction,
the new transaction might not yet see the effects of that "committed"
transaction. With that, you cannot reproduce the original bug with
pg_xact_status() though, so there's no point in committing that test then.
In summary, I think we should:
- commit and backpatch Simon's
just_remove_TransactionIdIsKnownCompleted_call.v1.patch
- fix pg_xact_status() to check TransactionIdIsInProgress()
- in v15, remove TransationIdIsKnownCompleted function altogether
I'll try to get around to that in the next few days, unless someone
beats me to it.
- Heikki
Attachment | Content-Type | Size |
---|---|---|
add-test-for-xid-cache-bug-1.patch | text/x-patch | 3.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Justin Pryzby | 2022-06-23 19:31:26 | Re: CI and test improvements |
Previous Message | Robert Haas | 2022-06-23 18:42:17 | Re: NAMEDATALEN increase because of non-latin languages |