Re: failures in t/031_recovery_conflict.pl on CI

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: failures in t/031_recovery_conflict.pl on CI
Date: 2022-04-13 00:26:26
Message-ID: 20220413002626.udl7lll7f3o7nre7@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-04-12 15:05:22 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2022-04-09 19:34:26 -0400, Tom Lane wrote:
> >> +1. This is probably more feasible given the latch infrastructure
> >> than it was when that code was first written.
>
> > What do you think about just reordering the disable_all_timeouts() to be
> > before the got_standby_deadlock_timeout check in the back branches? I think
> > that should close at least the most obvious hole. And fix it properly in
> > HEAD?
>
> I don't have much faith in that, and I don't see why we can't fix it
> properly.

It's not too hard, agreed.

It's somewhat hard to test that it works though. The new recovery conflict
tests test all recovery conflicts except for deadlock ones, because they're
not easy to trigger... But I think I nearly got it working reliably.

It's probably worth backpatching the tests, after stripping them of the stats
checks?

Three questions:

- For HEAD we have to replace the disable_all_timeouts() calls, it breaks the
replay progress reporting. Is there a reason to keep them in the
backbranches? Hard to see how an extension or something could rely on it,
but ...?

- I named the variable set by the signal handler got_standby_delay_timeout,
because just got_standby_timeout looked weird besides the _deadlock_. Which
makes me think that we should rename STANDBY_TIMEOUT to
STANDBY_DELAY_TIMEOUT too?

- There's the following comment in ResolveRecoveryConflictWithBufferPin():

"We assume that only UnpinBuffer() and the timeout requests established
above can wake us up here."

That bogus afaict? There's plenty other things that can cause MyProc->latch
to be set. Is it worth doing something about this at the same time? Right
now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid
succession initially.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2022-04-13 02:20:59 Re: Improving the "Routine Vacuuming" docs
Previous Message Peter Geoghegan 2022-04-13 00:22:04 Re: Improving the "Routine Vacuuming" docs