Re: stress test for parallel workers

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: stress test for parallel workers
Date: 2019-07-24 05:15:14
Message-ID: 17389.1563945314@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> In any case, the evidence from the buildfarm is pretty clear that
>> there is *some* connection. We've seen a lot of recent failures
>> involving "postmaster exited during a parallel transaction", while
>> the number of postmaster failures not involving that is epsilon.

> I don't have access to the build farm history in searchable format
> (I'll go and ask for that).

Yeah, it's definitely handy to be able to do SQL searches in the
history. I forget whether Dunstan or Frost is the person to ask
for access, but there's no reason you shouldn't have it.

> Do you have an example to hand? Is this
> failure always happening on Linux?

I dug around a bit further, and while my recollection of a lot of
"postmaster exited during a parallel transaction" failures is accurate,
there is a very strong correlation I'd not noticed: it's just a few
buildfarm critters that are producing those. To wit, I find that
string in these recent failures (checked all runs in the past 3 months):

sysname | branch | snapshot
-----------+---------------+---------------------
lorikeet | HEAD | 2019-06-16 20:28:25
lorikeet | HEAD | 2019-07-07 14:58:38
lorikeet | HEAD | 2019-07-02 10:38:08
lorikeet | HEAD | 2019-06-14 14:58:24
lorikeet | HEAD | 2019-07-04 20:28:44
lorikeet | HEAD | 2019-04-30 11:00:49
lorikeet | HEAD | 2019-06-19 20:29:27
lorikeet | HEAD | 2019-05-21 08:28:26
lorikeet | REL_11_STABLE | 2019-07-11 08:29:08
lorikeet | REL_11_STABLE | 2019-07-09 08:28:41
lorikeet | REL_12_STABLE | 2019-07-16 08:28:37
lorikeet | REL_12_STABLE | 2019-07-02 21:46:47
lorikeet | REL9_6_STABLE | 2019-07-02 20:28:14
vulpes | HEAD | 2019-06-14 09:18:18
vulpes | HEAD | 2019-06-27 09:17:19
vulpes | HEAD | 2019-07-21 09:01:45
vulpes | HEAD | 2019-06-12 09:11:02
vulpes | HEAD | 2019-07-05 08:43:29
vulpes | HEAD | 2019-07-15 08:43:28
vulpes | HEAD | 2019-07-19 09:28:12
wobbegong | HEAD | 2019-06-09 20:43:22
wobbegong | HEAD | 2019-07-02 21:17:41
wobbegong | HEAD | 2019-06-04 21:06:07
wobbegong | HEAD | 2019-07-14 20:43:54
wobbegong | HEAD | 2019-06-19 21:05:04
wobbegong | HEAD | 2019-07-08 20:55:18
wobbegong | HEAD | 2019-06-28 21:18:46
wobbegong | HEAD | 2019-06-02 20:43:20
wobbegong | HEAD | 2019-07-04 21:01:37
wobbegong | HEAD | 2019-06-14 21:20:59
wobbegong | HEAD | 2019-06-23 21:36:51
wobbegong | HEAD | 2019-07-18 21:31:36
(32 rows)

We already knew that lorikeet has its own peculiar stability
problems, and these other two critters run different compilers
on the same Fedora 27 ppc64le platform.

So I think I've got to take back the assertion that we've got
some lurking generic problem. This pattern looks way more
like a platform-specific issue. Overaggressive OOM killer
would fit the facts on vulpes/wobbegong, perhaps, though
it's odd that it only happens on HEAD runs.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-07-24 05:15:21 Re: Change atoi to strtol in same place
Previous Message Paul A Jungwirth 2019-07-24 05:13:07 Re: range_agg