shm_mq_wait_internal gets stuck forever on fast shutdown

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: shm_mq_wait_internal gets stuck forever on fast shutdown
Date: 2017-08-21 02:57:40
Message-ID: CAMsr+YHmm=01LsuEYR6YdZ8CLGfNK_fgdgi+QXUjF+JeLPvZQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all

I've noticed a possible bug / design limitation where shm_mq_wait_internal
sleep in a latch wait forever, and the postmaster gets stuck waiting for
the bgworker the wait is running in to exit.

This happens when the shm_mq does not have an associated bgworker handle
registered because the other end is not known at mq creation time or is a
normal backend not a bgworker. So a BGW handle cannot be passed.

shm_mq_wait_internal() will CHECK_FOR_INTERRUPTS() when its latch wait is
interrupted by a SIGTERM. But it doesn't actually respond to SIGTERM in any
way; it just merrily resets its latch and keeps looping.

It will bail out correctly on SIGQUIT.

If the proc waiting to attach was known at queue creation time and was a
bgworker, we'd pass a bgworker handle and the mq would notice it failed to
start and stop waiting. There's only a problem if no bgworker handle can be
supplied.

The underlying problem is that CHECK_FOR_INTERRUPTS() doesn't care about
SIGTERM or have any way to test for it. And we don't have any global
management of SIGTERM like we do SIGQUIT so the shm_mq_wait_internal loop
can't test for it.

The only ways I can see to fix this are:

* Generalize SIGTERM handling across postgres, so there's a global
"got_SIGTERM" global that shm_mq_wait_internal can test to break out of its
loop, and every backend's signal handler must set it. Lots of churn.

* In a proc's signal handler, use globals set before entry and after exit
from shm_mq operations to detect if we're currently in shm_mq and promote
SIGTERM to SIGQUIT by sending a new signal to ourselves. Or set up state so
CHECK_FOR_INTERRUPTS() will notice when the handler returns.

* Allow passing of a *bool that tests for SIGTERM, or a function pointer
called on each iteration to test whether looping should continue, to be
passed to shm_mq_attach. So if you can't supply a bgw handle, you supply
that instead. Provide a shm_mq_set_handle equivalent for it too.

Any objections to the last approach?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2017-08-21 03:20:55 Re: Add support for tuple routing to foreign partitions
Previous Message Craig Ringer 2017-08-21 02:47:58 Re: Updating line length guidelines