From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Dave Page <dave(dot)page(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, CM Team <cm(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, bernd(dot)helmle(at)credativ(dot)de |
Subject: | Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) |
Date: | 2014-09-29 18:52:29 |
Message-ID: | 20140929185229.GP16581@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2014-09-29 14:46:20 -0400, Robert Haas wrote:
> On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave(dot)page(at)enterprisedb(dot)com> wrote:
> >> Hamid(at)EDB; Can you please have someone configure anole to build git
> >> head as well as the other branches? Thanks.
> >
> > The test_shm_mq regression tests hung on this machine this morning.
> > Hamid was able to give me access to log in and troubleshoot.
> > Unfortunately, I wasn't able to completely track down the problem
> > before accidentally killing off the running cluster, but it looks like
> > test_shm_mq_pipelined() tried to start 3 background workers and the
> > postmaster only ever launched one of them, so the test just sat there
> > and waited for the other two workers to start. At this point, I have
> > no idea what could cause the postmaster to be asleep at the switch
> > like this, but it seems clear that's what happened.
>
> This happened again, and I investigated further. It looks like the
> postmaster knows full well that it's supposed to start more bgworkers:
> the ones that never get started are in the postmaster's
> BackgroundWorkerList, and StartWorkerNeeded is true. But it only
> starts the first one, not all three. Why?
>
> Here's my theory. When I did a backtrace inside the postmaster, it
> was stuck inside inside select(), within ServerLoop(). I think that's
> just where it was when the backend that wanted to run test_shm_mq
> requested that a few background workers get launched. Each
> registration would have sent the postmaster a separate SIGUSR1, but
> for some reason the postmaster only received one, which I think is
> legit behavior, though possibly not typical on modern Linux systems.
> When the SIGUSR1 arrived, the postmaster jumped into
> sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(),
> which launched the first background worker. Then it returned, and the
> arrival of the signal did NOT interrupt the pending select().
>
> This chain of events can't occur if an arriving SIGUSR1 causes
> select() to return EINTR or EWOULDBLOCK, nor can it happen if the
> signal handler is entered three separate times, once for each SIGUSR1.
> That combination of explanations seems likely sufficient to explain
> why this doesn't occur on other machines.
>
> The code seems to have been this way since the commit that introduced
> background workers (da07a1e856511dca59cbb1357616e26baa64428e),
> although the function was called StartOneBackgroundWorker back then.
If that theory is true, wouldn't things get unstuck everytime a new
connection comes in? Or 60 seconds have passed? That's not to say this
isn't wrong, but still?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2014-09-29 18:53:20 | Re: jsonb format is pessimal for toast compression |
Previous Message | Arthur Silva | 2014-09-29 18:49:21 | Re: jsonb format is pessimal for toast compression |