Re: Query running for very long time (server hanged) with parallel append

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Query running for very long time (server hanged) with parallel append
Date: 2018-02-05 09:59:27
Message-ID: CAJ3gD9eFR8=kxjPYBEHe34uT9+RYET0VbhGEfSt79eZx3L9E1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2 February 2018 at 20:46, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Feb 2, 2018 at 1:43 AM, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>> The query is actually hanging because one of the workers is in a small
>> loop where it iterates over the subplans searching for unfinished
>> plans, and it never comes out of the loop (it's a bug which I am yet
>> to fix). And it does not make sense to keep CHECK_FOR_INTERRUPTS in
>> each iteration; it's a small loop that does not pass control to any
>> other functions .
>
> Uh, sounds like we'd better fix that bug.

The scenario is this : One of the workers w1 hasn't yet chosen the
first plan, and all the plans are already finished. So w1 has it's
node->as_whichplan equal to -1. So the below condition in
choose_next_subplan_for_worker() never becomes true for this worker :

if (pstate->pa_next_plan == node->as_whichplan)
{
/* We've tried everything! */
pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
LWLockRelease(&pstate->pa_lock);
return false;
}

What I think is : we should save the information about which plan we
started the search with, before the loop starts. So, we save the
starting plan value like this, before the loop starts:
initial_plan = pstate->pa_next_plan;

And then break out of the loop when we come back to to this initial plan :
if (initial_plan == pstate->pa_next_plan)
break;

Now, suppose it so happens that initial_plan is a non-partial plan.
And then when we wrap around to the first partial plan, we check
(initial_plan == pstate->pa_next_plan) which will never become true
because initial_plan is less than first_partial_plan.

So what I have done in the patch is : have a flag 'should_wrap_around'
indicating that we should not wrap around. This flag is true when
initial_plan is a non-partial plan, in which case we know that we will
have covered all plans by the time we reach the last plan. So break
from the loop if this flag is false, or if we have reached the initial
plan.

Attached is a patch that fixes this issue on the above lines.

>
>> But I am not sure about this : while the workers are at it, why the
>> backend that is waiting for the workers does not come out of the wait
>> state with a SIGINT. I guess the same issue has been discussed in the
>> mail thread that you pointed.
>
> Is it getting stuck here?
>
> /*
> * We can't finish transaction commit or abort until all of the workers
> * have exited. This means, in particular, that we can't respond to
> * interrupts at this stage.
> */
> HOLD_INTERRUPTS();
> WaitForParallelWorkersToExit(pcxt);
> RESUME_INTERRUPTS();

Actually the backend is getting stuck in
choose_next_subplan_for_leader(), in LWLockAcquire(), waiting for the
hanging worker to release the pstate->pa_lock. I think there is
nothing wrong in this, because it is assumed that LWLock wait is going
to be for very short tiime, and because of this bug, the lwlock waits
forever.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
fix_hang_issue.patch application/octet-stream 1.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pierre Ducroquet 2018-02-05 10:39:06 Re: JIT compiling with LLVM v9.1
Previous Message Marina Polyakova 2018-02-05 09:49:40 Re: WIP Patch: Precalculate stable functions, infrastructure v1