Re: intermittent failures in Cygwin from select_parallel tests

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: intermittent failures in Cygwin from select_parallel tests
Date: 2017-06-06 19:07:48
Message-ID: CA+TgmoYGQViFsVPeMQM+9KvDAiPCEY1SmuH4=UrbfVjUswQ9ig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> One thought is that the only places where shm_mq_set_sender() should
>> be getting invoked during the main regression tests are
>> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
>> places using ParallelWorkerNumber to figure out what address to pass.
>> So if ParallelWorkerNumber were getting set to the same value in two
>> different parallel workers - e.g. because the postmaster went nuts and
>> launched two processes instead of only one - or if
>> ParallelWorkerNumber were not getting initialized at all or were
>> getting initialized to some completely bogus value, it could cause
>> this symptom.
>
> Hmm. With some generous assumptions it'd be possible to think that
> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was
> present in 20 successful lorikeet runs before the first of these failures,
> which is a bit more than the MTBF after that, but not a huge amount more.
>
> That commit in itself looks innocent enough, but could it have exposed
> some latent bug in bgworker launching?

Hmm, that's a really interesting idea, but I can't quite put together
a plausible theory around it. I mean, it seems like that commit could
make launching bgworkers faster, which could conceivably tickle some
heretofore-latent timing-related bug. But it wouldn't, IIUC, make the
first worker start any faster than before - it would just make them
more closely-spaced thereafter, and it's not very obvious how that
would cause a problem.

Another idea is that the commit in question is managing to corrupt
BackgroundWorkerList somehow. maybe_start_bgworkers() is using
slist_foreach_modify(), but previously it always returned after
calling do_start_bgworker, and now it doesn't. So if
do_start_bgworker() did something that could modify the list
structure, then perhaps maybe_start_bgworkers() would get confused. I
don't really think that this theory has any legs, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-06-06 19:09:32 Re: logical replication - still unstable after all these months
Previous Message Erik Rijkers 2017-06-06 19:01:25 Re: logical replication - still unstable after all these months