hyrax versus isolationtester.c's hard-wired timeouts

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: schmiddy(at)gmail(dot)com
Subject: hyrax versus isolationtester.c's hard-wired timeouts
Date: 2019-12-08 22:08:55
Message-ID: 22964.1575842935@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Buildfarm member hyrax has been intermittently failing the
deadlock-parallel isolation test ever since that went in.
I finally got around to looking at this closely, and what
seems to be happening is simply that isolationtester.c's
hard-wired three-minute timeout for the completion of any
one test step is triggering. hyrax uses CLOBBER_CACHE_ALWAYS
and it seems to be a little slower than other animals using
CLOBBER_CACHE_ALWAYS, so it's unsurprising that it's showing
the symptom and nobody else is.

There are two things we could do about this:

1. Knock the hard-wired setting up a tad, maybe to 5 minutes.
Easy but doesn't seem terribly future-proof.

2. Make the limit configurable somehow, probably from an
environment variable. There's precedent for that (PGCTLTIMEOUT),
and it would provide a way for owners of especially slow buildfarm
members to adjust things ... but it would require owners of
especially slow buildfarm animals to adjust things.

Any preferences? (Actually, it wouldn't be unreasonable to do
both things, I suppose.)

BTW, I notice that isolationtester.c fails to print any sort of warning
notice when it decides it's waited too long. This seems like a
spectacularly bad idea in hindsight: it's not that obvious why the test
case failed. Plus there's no way to tell exactly which connection it
decided to send a PQcancel to. So independently of the timeout-length
issue, I think we ought to also make it print something like
"isolationtester: waited too long for something to happen, canceling
step thus-and-so".

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2019-12-08 23:10:33 Re: logical decoding : exceeded maxAllocatedDescs for .spill files
Previous Message Dent John 2019-12-08 20:33:02 Re: The flinfo->fn_extra question, from me this time.