Re: stress test for parallel workers

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Mark Wong <mark(at)2ndquadrant(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: stress test for parallel workers
Date: 2019-10-14 15:50:31
Message-ID: 27924.1571068231@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Filed at
> https://bugzilla.kernel.org/show_bug.cgi?id=205183
> We'll see what happens ...

Further to this --- I went back and looked at the outlier events
where we saw an infinite_recurse failure on a non-Linux-PPC64
platform. There were only three:

mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | REL_11_STABLE | 2019-08-11 02:10:12 | InstallCheck-C | 2019-08-11 02:36:10.159 PDT [5004:4] DETAIL: Failed process was running: select infinite_recurse();
mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | REL_12_STABLE | 2019-08-11 09:52:46 | pg_upgradeCheck | 2019-08-11 04:21:16.756 PDT [6804:5] DETAIL: Failed process was running: select infinite_recurse();
mereswine | ARMv7 | Linux debian-armhf | Clarence Ho | HEAD | 2019-08-11 11:29:27 | pg_upgradeCheck | 2019-08-11 07:15:28.454 PDT [9954:76] DETAIL: Failed process was running: select infinite_recurse();

Looking closer at these, though, they were *not* SIGSEGV failures,
but SIGKILLs. Seeing that they were all on the same machine on the
same day, I'm thinking we can write them off as a transiently
misconfigured OOM killer.

So, pending some other theory emerging from the kernel hackers, we're
down to it's-a-PPC64-kernel-bug. That leaves me wondering what if
anything we want to do about it. Even if it's fixed reasonably promptly
in Linux upstream, and then we successfully nag assorted vendors to
incorporate the fix quickly, that's still going to leave us with frequent
buildfarm failures on Mark's flotilla of not-the-very-shiniest Linux
versions.

Should we move the infinite_recurse test to happen alone in a parallel
group just to stop these failures? That's annoying from a parallelism
standpoint, but I don't see any other way to avoid these failures.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2019-10-14 16:18:12 Re: Non-Active links being referred in our source code
Previous Message Mark Dilger 2019-10-14 15:12:29 Re: Fix most -Wundef warnings