Re: Query running for very long time (server hanged) with parallel append

From: David Kohn <djk447(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Query running for very long time (server hanged) with parallel append
Date: 2018-02-02 23:17:49
Message-ID: CAJhMaBh4uUh--XvaPtiE9OPPWC3E-aXgcnysz38sSGOLRuyT5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Please forgive my inexperience with the codebase, but as the guy who
reported this bugger:
https://www.postgresql.org/message-id/flat/151724453314(dot)1238(dot)409882538067070269%40wrigleys(dot)postgresql(dot)org#151724453314(dot)1238(dot)409882538067070269(at)wrigleys(dot)postgresql(dot)org,
I thought I'd follow your hints, as it's causing some major issues for me.
So some notes on what is happening for me and some (possibly silly)
thoughts on why:

On Fri, Feb 2, 2018 at 10:16 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Is it getting stuck here?
>
> /*
> * We can't finish transaction commit or abort until all of the workers
> * have exited. This means, in particular, that we can't respond to
> * interrupts at this stage.
> */
> HOLD_INTERRUPTS();
> WaitForParallelWorkersToExit(pcxt);
> RESUME_INTERRUPTS();
>
I am seeing unkillable queries with the client backend in
IPC-BgWorkerShutdown wait event, which, it appears to me can only happen
inside of bgworker.c at WaitForBackgroundWorkerShutdown which is called by
parallel.c at WaitForParallelWorkersToExit inside of
DestroyParallelContext, which seems like it should be called when there is
a statement timeout (which I think is happening in at least some of my
cases) so it would make sense that this is where the problem is.

My background workers are in the IPC-MessageQueuePutMessage event, which
appears to only be possible from pqmq.c at mq_putmessage , directly
following the WaitLatch, there is a CHECK_FOR_INTERRUPTS(); so, if it's
waiting on that latch and never gets to the interrupt that would explain
things. Also it appears that it sends a signal to the leader process a few
lines before starting to wait, which is supposed to tell the leader to come
read messages off the queue. If the leader gets to
WaitForParallelWorkersToExit at the wrong time and ends up waiting on that
event, I could see how they would both end up waiting for the other and
never finishing.

The thing is that DestroyParallelContext seems to be detaching from the
queues, but if the worker hit the wait step before the leader detaches from
the queue does it have any way of knowing that?

Anyway, I'm entirely unsure of my analysis here, but thought I'd offer
something to help speed this along.

Best,
David Kohn

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-02-03 00:06:01 Re: RelOptInfo -> Relation
Previous Message Tom Lane 2018-02-02 23:04:44 Re: Boolean partitions syntax