Re: stress test for parallel workers

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Mark Wong <mark(at)2ndquadrant(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: stress test for parallel workers
Date: 2019-10-11 21:13:04
Message-ID: 21051.1570828384@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> It's not very clear how those things would lead to an intermittent
> failure though. In the case of the postmaster crashes, we now see
> that timing of signal receipts is relevant. For infinite_recurse,
> maybe it only fails if an sinval interrupt happens at the wrong time?
> (This theory would predict that commit 798070ec0 made the problem
> way more prevalent than it had been ... need to go see if the
> buildfarm history supports that.)

That seems to fit, roughly: commit 798070ec0 moved errors.sql to execute
as part of a parallel group on 2019-04-11, and the first failure of the
infinite_recurse test happened on 2019-04-27. Since then we've averaged
about one such failure every four days, which makes a sixteen-day gap a
little more than you'd expect, but not a huge amount more. Anyway,
I do not see any other commits in between that would plausibly have
affected this.

In other news, I reproduced the problem with gcc on wobbegong's host,
and confirmed that the gcc build uses less stack space: one recursive
cycle of reaper() and sigusr1_handler() consumes 14768 bytes with clang,
but just 9296 bytes with gcc. So the evident difference in failure rate
between wobbegong and vulpes is nicely explained by that. Still no
theory about pg_upgrade versus vanilla "make check" though. I did manage
to make it happen during "make check" by dint of reducing the "ulimit -s"
setting, so it's *possible* for it to happen there, it just doesn't.
Weird.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-10-11 21:39:47 Re: dropping column prevented due to inherited index
Previous Message Thomas Munro 2019-10-11 21:03:22 Re: stress test for parallel workers