Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Dave Page <dave(dot)page(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, CM Team <cm(at)enterprisedb(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, bernd(dot)helmle(at)credativ(dot)de
Subject: Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
Date: 2014-10-03 18:38:10
Message-ID: CA+Tgmobr+Q2WgWeasdbDNefVwJkAGALxA=-VtEGNtQgL1V2Ryw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 3, 2014 at 1:09 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Further debugging reveals that sigusr1_handler() gets called
> repeatedly, to start autovacuum workers, and it keeps waking up and
> starting them. But that doesn't cause the background workers to get
> started either, because although sigusr1_handler() contains a call to
> maybe_start_bgworker, it only does that if start_bgworker = true,
> which only happens if
> CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true,
> which on these calls it isn't.
> And I think this might also be the missing ingredient answering
> Andres's question from before: why doesn't the 60-second
> select()-timeout cause the background worker to eventually start even
> if the SELECT doesn't get interrupted? There seems to be a SIGUSR1
> arriving about every 3 seconds, and I bet that's resetting the timer
> every time.

For now I propose to address this by committing the attached patch,
which gets rid of the separate start_bgworker flag inside
sigusr1_handler() and instead uses the same StartWorkerNeeded ||
HaveCrashedWorker test that we use inside ServerLoop() to decide
whether to call maybe_start_bgworker(). Either more signals will
arrive (in which case the signal handler will launch an additional
background worker every time a signal arrives) or they won't (in which
case the 60-second timeout will eventually expire, and ServerLoop()
will kick into high gear and satisfy all outstanding requests). This
isn't really right, because there might still be a quite noticeable
delay waiting for workers to get launched, but at least the delay
would be bounded to at most 60 seconds rather than, as at present,
potentially infinite.

A totally correct fix will require a bit more thought. A search of
the git history reveals that the problem of a signal restarting the
timeout is not new: Tom fixed a similar problem back in 2007 by making
the autovacuum launcher sleep for at most a second at a time. Such a
fix isn't ideal here, because we really don't want an up-to-1-second
delay launching a newly-registered background worker if there's a way
to avoid that -- it's probably OK for launching daemons, but it's not
so hot for parallel query. However, we could:

(1) Use the self-pipe trick. We could not directly use a latch, at
least not without a new API, because we might be waiting on more than
one socket.

(2) Have the postmaster not set SA_RESTART for the sigusr1 handler. I
don't know how platform-independent this approach would be.

(3) Have sigusr1_handler repeatedly call maybe_start_bgworker() until
StartWorkerNeeded becomes false, instead of just calling it once.
ServerLoop() is carefully coded to call maybe_start_bgworker() just
once per iteration, presumably to make sure the server stays
responsive even if the bgworker-starting machinery is quite busy;
looping inside the signal handler would give up that nice property
unless we had some way to break out of the loop if there's activity on
the socket.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
prevent-infinite-hang-on-anole.patch text/x-patch 926 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2014-10-03 18:39:11 Re: Dynamic LWLock tracing via pg_stat_lwlock (proof of concept)
Previous Message Alvaro Herrera 2014-10-03 18:33:46 Re: replicating DROP commands across servers