Re: intermittent failures in Cygwin from select_parallel tests

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: intermittent failures in Cygwin from select_parallel tests
Date: 2017-06-07 10:36:31
Message-ID: CAA4eK1+KK0kBM0OOTmUpbqbZPFBb0Um_2HxRE0wKDgLKwwYRAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 7, 2017 at 12:37 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> One thought is that the only places where shm_mq_set_sender() should
>>> be getting invoked during the main regression tests are
>>> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
>>> places using ParallelWorkerNumber to figure out what address to pass.
>>> So if ParallelWorkerNumber were getting set to the same value in two
>>> different parallel workers - e.g. because the postmaster went nuts and
>>> launched two processes instead of only one - or if
>>> ParallelWorkerNumber were not getting initialized at all or were
>>> getting initialized to some completely bogus value, it could cause
>>> this symptom.
>>
>> Hmm. With some generous assumptions it'd be possible to think that
>> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was
>> present in 20 successful lorikeet runs before the first of these failures,
>> which is a bit more than the MTBF after that, but not a huge amount more.
>>
>> That commit in itself looks innocent enough, but could it have exposed
>> some latent bug in bgworker launching?
>
> Hmm, that's a really interesting idea, but I can't quite put together
> a plausible theory around it. I mean, it seems like that commit could
> make launching bgworkers faster, which could conceivably tickle some
> heretofore-latent timing-related bug. But it wouldn't, IIUC, make the
> first worker start any faster than before - it would just make them
> more closely-spaced thereafter, and it's not very obvious how that
> would cause a problem.
>
> Another idea is that the commit in question is managing to corrupt
> BackgroundWorkerList somehow.
>

I don't think so because this problem has been reported previously as
well [1][2] even before the commit in question.

[1] - https://www.postgresql.org/message-id/1ce5a19f-3b1d-bb1c-4561-0158176f65f1%40dunslane.net
[2] - https://www.postgresql.org/message-id/25861.1472215822%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2017-06-07 10:42:04 Re: Challenges preventing us moving to 64 bit transaction id (XID)?
Previous Message sanyam jain 2017-06-07 10:16:07 Use of snapshot in logical replication