Re: stress test for parallel workers

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Mark Wong <mark(at)2ndquadrant(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: stress test for parallel workers
Date: 2019-10-11 20:13:46
Message-ID: 18266.1570824826@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> Yeah, I don't know anything about this stuff, but I was also beginning
> to wonder if something is busted in the arch-specific fault.c code
> that checks if stack expansion is valid[1], in a way that fails with a
> rapidly growing stack, well timed incoming signals, and perhaps
> Docker/LXC (that's on Mark's systems IIUC, not sure about the ARM
> boxes that failed or if it could be relevant here). Perhaps the
> arbitrary tolerances mentioned in that comment are relevant.
> [1] https://github.com/torvalds/linux/blob/master/arch/powerpc/mm/fault.c#L244

Hm, the bit about "we'll allow up to 1MB unconditionally" sure seems
to match up with the observations here. I also wonder about the
arbitrary definition of "a long way" as 2KB. Could it be that that
misbehaves in the face of a userland function with more than 2KB of
local variables?

It's not very clear how those things would lead to an intermittent
failure though. In the case of the postmaster crashes, we now see
that timing of signal receipts is relevant. For infinite_recurse,
maybe it only fails if an sinval interrupt happens at the wrong time?
(This theory would predict that commit 798070ec0 made the problem
way more prevalent than it had been ... need to go see if the
buildfarm history supports that.)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Wong 2019-10-11 20:28:53 Re: stress test for parallel workers
Previous Message Tom Lane 2019-10-11 19:48:52 Re: Connect as multiple users using single client certificate