Re: Parallel query hangs after a smart shutdown is issued

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
Subject: Re: Parallel query hangs after a smart shutdown is issued
Date: 2020-08-12 18:54:38
Message-ID: CA+hUKGJZ9d2vUTZMPnNkZP_pJ-7++XyHP48K+OYEi6S4feUD4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 13, 2020 at 6:00 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> > On Thu, Aug 13, 2020 at 3:32 AM Bharath Rupireddy
> > <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> >> After a smart shutdown is issued(with pg_ctl), run a parallel query,
> >> then the query hangs.
>
> > Yeah, the current situation is not good. I think your option 2 sounds
> > better, because the documented behaviour of smart shutdown is that it
> > "lets existing sessions end their work normally". I think that means
> > that a query that is already running or allowed to start should be
> > able to start new workers and not have its existing workers
> > terminated. Arseny Sher wrote a couple of different patches to try
> > that last year, but they fell through the cracks:
> > https://www.postgresql.org/message-id/flat/CA%2BhUKGLrJij0BuFtHsMHT4QnLP54Z3S6vGVBCWR8A49%2BNzctCw%40mail.gmail.com
>
> I already commented on this in the other thread that Bharath started [1].
> I think the real issue here is why is the postmaster's SIGTERM handler
> doing *anything* other than disallowing new connections? It seems quite
> premature to kill support processes of any sort, not only parallel
> workers. The documentation says existing clients are allowed to end
> their work, not that their performance is going to be crippled until
> they end.

Right. It's pretty strange that during smart shutdown, you could run
for hours with no autovacuum, walwriter, bgwriter. I guess Arseny and
I were looking for a minimal change to fix a bug, but clearly there's a
more general problem and this change works out cleaner anyway.

> So I looked at moving the kills of all the support processes to happen
> after we detect that there are no remaining regular backends, and it
> seems to not be too hard. I realized that the existing PM_WAIT_READONLY
> state is doing that already, but just for a subset of support processes
> that it thinks might be active in hot standby mode. So what I did in the
> attached was to repurpose that state as "PM_WAIT_CLIENTS", which does the
> right thing in either regular or hot standby mode.

Make sense, works as expected and passes check-world.

> One other thing I changed here was to remove PM_WAIT_READONLY from the
> set of states in which we'll allow promotion to occur or a new walreceiver
> to start. I'm not convinced that either of those behaviors aren't
> bugs; although if someone thinks they're right, we can certainly put
> back PM_WAIT_CLIENTS in those checks. (But, for example, it does not
> appear possible to reach PM_WAIT_READONLY/PM_WAIT_CLIENTS state with
> Shutdown == NoShutdown, so the test in MaybeStartWalReceiver sure looks
> like confusingly dead code to me. If we do want to allow restarting
> the walreceiver in this state, the Shutdown condition needs fixed.)

If a walreceiver is allowed to run, why should it not be allowed to
restart? Yeah, I suppose that other test'd need to be Shutdown <=
SmartShutdown, just like we do in SIGHUP_handler(). Looking at other
places where we test Shutdown == NoShutdown, one that jumps out is the
autovacuum wraparound defence stuff and the nearby
PMSIGNAL_START_AUTOVAC_WORKER code.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-08-12 19:28:26 Re: Parallel query hangs after a smart shutdown is issued
Previous Message Marco Atzeri 2020-08-12 18:31:06 ltree_plpython failure test on Cygwin for 12.4 test