Re: Windows buildfarm members vs. new async-notify isolation test

From: Mark Dilger <hornschnorter(at)gmail(dot)com>
To: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Windows buildfarm members vs. new async-notify isolation test
Date: 2019-12-03 15:11:47
Message-ID: 362bca0b-1c1c-c760-ab19-b5d9a14c69ea@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/2/19 11:42 AM, Andrew Dunstan wrote:
>
> On 12/2/19 11:23 AM, Tom Lane wrote:
>> I see from the buildfarm status page that since commits 6b802cfc7
>> et al went in a week ago, frogmouth and currawong have failed that
>> new test case every time, with the symptom
>>
>> ================== pgsql.build/src/test/isolation/regression.diffs ===================
>> *** c:/prog/bf/root/REL_10_STABLE/pgsql.build/src/test/isolation/expected/async-notify.out Mon Nov 25 00:30:49 2019
>> --- c:/prog/bf/root/REL_10_STABLE/pgsql.build/src/test/isolation/results/async-notify.out Mon Dec 2 00:54:26 2019
>> ***************
>> *** 93,99 ****
>> step llisten: LISTEN c1; LISTEN c2;
>> step lcommit: COMMIT;
>> step l2commit: COMMIT;
>> - listener2: NOTIFY "c1" with payload "" from notifier
>> step l2stop: UNLISTEN *;
>>
>> starting permutation: llisten lbegin usage bignotify usage
>> --- 93,98 ----
>>
>> (Note that these two critters don't run branches v11 and up, which
>> is why they're only showing this failure in 10 and 9.6.)
>>
>> drongo showed the same failure once in v10, and fairywren showed
>> it once in v12. Every other buildfarm animal seems happy.
>>
>> I'm a little baffled as to what this might be --- some sort of
>> timing problem in our Windows signal emulation, perhaps? But
>> if so, why haven't we found it years ago?
>>
>> I don't have any ability to test this myself, so would appreciate
>> help or ideas.
>
>
>
> I can test things, but I don't really know what to test. FYI frogmouth
> and currawong run on virtualized XP. drongo anf fairywrne run on
> virtualized WS2019. Neither VM is heavily resourced.

Hi Andrew, if you have time you could perhaps check the
isolation test structure itself. Like Tom, I don't have a
Windows box to test this.

I would be curious to see if there is a race condition in
src/test/isolation/isolationtester.c between the loop starting
on line 820:

while ((res = PQgetResult(conn)))
{
...
}

and the attempt to consume input that might include NOTIFY
messages on line 861:

PQconsumeInput(conn);

If the first loop consumes the commit message, gets no
further PGresult from PQgetResult, and finishes, and execution
proceeds to PQconsumeInput before the NOTIFY has arrived
over the socket, there won't be anything for PQnotifies to
return, and hence for try_complete_step to print before
returning.

I'm not sure if it is possible for the commit message to
arrive before the notify message in the fashion I am describing,
but that's something you might easily check by having
isolationtester sleep before PQconsumeInput on line 861.

--
Mark Dilger

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-12-03 15:18:06 Re: Allow relocatable extension to use @extschema@?
Previous Message Tom Lane 2019-12-03 15:07:08 Re: [PATCH] Addition of JetBrains project directory to .gitignore