Re: failures in t/031_recovery_conflict.pl on CI

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: failures in t/031_recovery_conflict.pl on CI
Date: 2022-04-29 20:08:09
Message-ID: 20220429200809.uuuseakmhb57sppk@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached are patches for this issue.

It adds a test case for deadlock conflicts to make sure that case isn't
broken. I also tested the recovery conflict tests in the back branches, and
they work there with a reasonably small set of changes.

Questions:
- I'm planning to backpatch the test as 031_recovery_conflict.pl, even though
preceding numbers are unused. It seems way more problematic to use a
different number in the backbranches than have gaps?

- The test uses pump_until() and wait_for_log(), which don't exist in the
backbranches. For now I've just inlined the implementation, but I guess we
could also backpatch their introduction?

- There's a few incompatibilities in the test with older branches:
- older branches don't have allow_in_place_tablespaces - I think just
skipping tablespace conflicts is fine, they're comparatively
simple.

Eventually it might make sense to backpatch allow_in_place_tablespaces,
our test coverage in the area is quite poor.

- the stats tests can't easily made reliably in the backbranches - which is
ok, as the conflict itself is verified via the log

- some branches don't have log_recovery_conflict_waits, since it's not
critical to the test, it's ok to just not include it there

I played with the idea of handling the differences using version comparisons
in the code, and have the test be identically across branches. Since it's
something we don't do so far, I'm leaning against it, but ...

> - For HEAD we have to replace the disable_all_timeouts() calls, it breaks the
> replay progress reporting. Is there a reason to keep them in the
> backbranches? Hard to see how an extension or something could rely on it,
> but ...?

I've left it as is for now, will start a separate thread.

> - There's the following comment in ResolveRecoveryConflictWithBufferPin():
>
> "We assume that only UnpinBuffer() and the timeout requests established
> above can wake us up here."
>
> That bogus afaict? There's plenty other things that can cause MyProc->latch
> to be set. Is it worth doing something about this at the same time? Right
> now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid
> succession initially.

The comment is more recent than I had realized. I raised this separately in
https://postgr.es/m/20220429191815.xewxjlpmq7mxhsr2%40alap3.anarazel.de

pgindent uses some crazy formatting nearby:
SendRecoveryConflictWithBufferPin(
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);

I'm tempted to clean that up in passing by having just one
SendRecoveryConflictWithBufferPin() call instead of two, storing the type of
conflict in a local variable? Doesn't look entirely pretty either, but ...

I'm very doubtful of this claim above ResolveRecoveryConflictWithBufferPin(),
btw. But that'd be a non-backpatchable cleanup, I think:
* The ProcWaitForSignal() sleep normally done in LockBufferForCleanup()
* (when not InHotStandby) is performed here, for code clarity.

Greetings,

Andres Freund

Attachment Content-Type Size
vHEAD-0001-Fix-possibility-of-self-deadlock-in-ResolveRec.patch text/x-diff 2.9 KB
vHEAD-0002-Add-tests-for-recovery-deadlock-conflicts.patch text/x-diff 4.8 KB
vREL_10_STABLE-0001-Fix-possibility-of-self-deadlock-in-R.patch text/x-diff 2.7 KB
vREL_10_STABLE-0002-Backpatch-031_recovery_conflict.pl.patch text/x-diff 13.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-04-29 20:11:45 Re: [RFC] building postgres with meson -v8
Previous Message Tom Lane 2022-04-29 19:38:03 Re: fix cost subqueryscan wrong parallel cost