Re: deadlock-hard flakiness

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: deadlock-hard flakiness
Date: 2023-02-08 22:11:45
Message-ID: 20230208221145.bwzhancellclrgia@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-07 17:10:21 -0800, Andres Freund wrote:
> diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out
> --- /tmp/cirrus-ci-build/src/test/isolation/expected/deadlock-hard.out 2023-02-07 05:32:34.536429000 +0000
> +++ /tmp/cirrus-ci-build/build/testrun/isolation-running/isolation/results/deadlock-hard.out 2023-02-07 05:40:33.833908000 +0000
> @@ -25,10 +25,11 @@
> step s6a7: <... completed>
> step s6c: COMMIT;
> step s5a6: <... completed>
> -step s5c: COMMIT;
> +step s5c: COMMIT; <waiting ...>
> step s4a5: <... completed>
> step s4c: COMMIT;
> step s3a4: <... completed>
> +step s5c: <... completed>
> step s3c: COMMIT;
> step s2a3: <... completed>
> step s2c: COMMIT;
>
>
> Commit 741d7f1047f fixed a similar issue in deadlock-hard. But it looks like
> we need something more. But perhaps this isn't an output ordering issue:
>
> How can we end up with s5c getting reported as waiting? I don't see how s5c
> could end up blocking on anything?

After looking through isolationtester's blocking detection logic I started to
suspect that what we're seeing is not being blocked by a heavyweight lock, but
by a snapshot. So I added logging to
pg_isolation_test_session_is_blocked(). Took a while to reproduce the issue,
but indeed:
https://cirrus-ci.com/task/4901334571286528
https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/isolation-running/isolation/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/4901334571286528/testrun/build/testrun/runningcheck.log

indicates that we indeed were blocked by a snapshot:
2023-02-08 21:30:12.123 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8971:0] LOG: pid 9280 blocked due to snapshot by pid: 0
...
2023-02-08 21:30:12.155 UTC [9276][client backend] [isolation/deadlock-hard/control connection][3/8973:0] LOG: pid 9278 blocked due to snapshot by pid: 0

Unclear why we end up without a pid. It looks like 2PC removes the pid from
the field? In the problematic case the prepared_xacts test is indeed
scheduled concurrently:

2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] ERROR: transaction identifier "foo3" is already in use
2023-02-08 21:30:12.100 UTC [9397][client backend] [pg_regress/prepared_xacts][23/1296:39171] STATEMENT: PREPARE TRANSACTION 'foo3';

foo3 for example does use SERIALIZABLE.

I don't really understand how GetSafeSnapshotBlockingPids() can end up finding
deadlock-hard's sessions being blocked by a safe snapshot. Afaict nothing uses
serializable in that test. How can SXACT_FLAG_DEFERRABLE_WAITING be set for
the sxact of a backend that never did serializable? Are we possibly forgetting
to clear it or such?

I don't think it should affect the reports here, but I did break something
when removing SHMQueue - GetSafeSnapshotBlockingPids() doesn't check
output_size anymore. Will fix. Thomas, any chance you could do a pass through
96003717645 to see if I screwed up other things? I stared a lot at that
change, but I obviously did miss at least one thing.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-02-08 22:15:28 Re: deadlock-hard flakiness
Previous Message Andrew Dunstan 2023-02-08 22:09:45 Re: run pgindent on a regular basis / scripted manner