Re: failures in t/031_recovery_conflict.pl on CI

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: failures in t/031_recovery_conflict.pl on CI
Date: 2022-05-03 18:20:25
Message-ID: 20220503182025.wvbebs2ojk6vpi5f@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-05-03 01:16:46 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2022-05-02 23:44:32 -0400, Tom Lane wrote:
> >> I can poke into that tomorrow, but are you sure that that isn't an
> >> expectable result?
>
> > It's not expected. But I think I might see what the problem is:
> > We wait for the FETCH (and thus the buffer pin to be acquired). But that
> > doesn't guarantee that the lock has been acquired. We can't check that with
> > pump_until() afaics, because there'll not be any output. But a query_until()
> > checking pg_locks should do the trick?
>
> Irritatingly, it doesn't reproduce (at least not easily) in a manual
> build on the same box.

Odd, given how readily it seem to reproduce on the bf. I assume you built with
> Uses -fsanitize=alignment -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

> So it's almost surely a timing issue, and your theory here seems plausible.

Unfortunately I don't think my theory holds, because I actually had added a
defense against this into the test that I forgot about momentarily...

# just to make sure we're waiting for lock already
ok( $node_standby->poll_query_until(
'postgres', qq[
SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
], 'waiting'),
"$sect: lock acquisition is waiting");

and on longfin that step completes sucessfully.

I think what happens is that we get a buffer pin conflict, because these days
we can actually process buffer pin conflicts while waiting for a lock. The
easiest way to get around that is to increase the replay timeout for that
test, I think?

I think we need a restart, not a reload, because reloads aren't guaranteed to
be processed at any certain point in time :/.

Testing a fix in a variety of timing circumstances now...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-05-03 18:23:23 Re: failures in t/031_recovery_conflict.pl on CI
Previous Message Tom Lane 2022-05-03 18:13:54 Re: fix cost subqueryscan wrong parallel cost