Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dave Page <dave(dot)page(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, CM Team <cm(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, bernd(dot)helmle(at)credativ(dot)de
Subject: Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
Date: 2014-09-29 18:46:20
Message-ID: CA+TgmoZJ7z9_1Jwvq8GeANBfjEjUcmNCgNzeHKVp42dSWd2SWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave(dot)page(at)enterprisedb(dot)com> wrote:
>> Hamid(at)EDB; Can you please have someone configure anole to build git
>> head as well as the other branches? Thanks.
>
> The test_shm_mq regression tests hung on this machine this morning.
> Hamid was able to give me access to log in and troubleshoot.
> Unfortunately, I wasn't able to completely track down the problem
> before accidentally killing off the running cluster, but it looks like
> test_shm_mq_pipelined() tried to start 3 background workers and the
> postmaster only ever launched one of them, so the test just sat there
> and waited for the other two workers to start. At this point, I have
> no idea what could cause the postmaster to be asleep at the switch
> like this, but it seems clear that's what happened.

This happened again, and I investigated further. It looks like the
postmaster knows full well that it's supposed to start more bgworkers:
the ones that never get started are in the postmaster's
BackgroundWorkerList, and StartWorkerNeeded is true. But it only
starts the first one, not all three. Why?

Here's my theory. When I did a backtrace inside the postmaster, it
was stuck inside inside select(), within ServerLoop(). I think that's
just where it was when the backend that wanted to run test_shm_mq
requested that a few background workers get launched. Each
registration would have sent the postmaster a separate SIGUSR1, but
for some reason the postmaster only received one, which I think is
legit behavior, though possibly not typical on modern Linux systems.
When the SIGUSR1 arrived, the postmaster jumped into
sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(),
which launched the first background worker. Then it returned, and the
arrival of the signal did NOT interrupt the pending select().

This chain of events can't occur if an arriving SIGUSR1 causes
select() to return EINTR or EWOULDBLOCK, nor can it happen if the
signal handler is entered three separate times, once for each SIGUSR1.
That combination of explanations seems likely sufficient to explain
why this doesn't occur on other machines.

The code seems to have been this way since the commit that introduced
background workers (da07a1e856511dca59cbb1357616e26baa64428e),
although the function was called StartOneBackgroundWorker back then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-09-29 18:47:16 Re: open items for 9.4
Previous Message Tom Lane 2014-09-29 18:44:42 Re: open items for 9.4