Re: intermittent failures in Cygwin from select_parallel tests

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: intermittent failures in Cygwin from select_parallel tests
Date: 2017-06-06 20:25:37
Message-ID: 31674.1496780737@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Hmm. With some generous assumptions it'd be possible to think that
>> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was
>> present in 20 successful lorikeet runs before the first of these failures,
>> which is a bit more than the MTBF after that, but not a huge amount more.
>> That commit in itself looks innocent enough, but could it have exposed
>> some latent bug in bgworker launching?

> Hmm, that's a really interesting idea, but I can't quite put together
> a plausible theory around it.

Yeah, me either, but we're really theorizing in advance of the data here.
Andrew, could you apply the attached patch on lorikeet and run the
regression tests enough times to get a couple of failures? Then grepping
the postmaster log for 'parallel worker' should give you results like

2017-06-06 16:20:12.393 EDT [31216] LOG: starting PID 31216, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.400 EDT [31216] LOG: stopping PID 31216, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.406 EDT [31217] LOG: starting PID 31217, parallel worker for PID 31215, worker number 3
2017-06-06 16:20:12.406 EDT [31218] LOG: starting PID 31218, parallel worker for PID 31215, worker number 2
2017-06-06 16:20:12.406 EDT [31219] LOG: starting PID 31219, parallel worker for PID 31215, worker number 1
2017-06-06 16:20:12.406 EDT [31220] LOG: starting PID 31220, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.412 EDT [31218] LOG: stopping PID 31218, parallel worker for PID 31215, worker number 2
2017-06-06 16:20:12.412 EDT [31219] LOG: stopping PID 31219, parallel worker for PID 31215, worker number 1
2017-06-06 16:20:12.412 EDT [31220] LOG: stopping PID 31220, parallel worker for PID 31215, worker number 0
2017-06-06 16:20:12.412 EDT [31217] LOG: stopping PID 31217, parallel worker for PID 31215, worker number 3
... etc etc ...

If it looks different from that in a crash case, we'll have something
to go on.

(I'm tempted to add something like this permanently, at DEBUG1 or DEBUG2
or so.)

regards, tom lane

Attachment Content-Type Size
log-parallel-worker-start-stop.patch text/x-diff 1.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2017-06-06 20:25:46 Re: PG10 transition tables, wCTEs and multiple operations on the same table
Previous Message Joe Conway 2017-06-06 20:07:13 Re: BUG #14682: row level security not work with partitioned table