Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, James Coleman <jtc331(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date: 2019-12-17 23:21:24
Message-ID: 20191217232124.3dtrycatgfm6oxxb@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Dec 18, 2019 at 11:21:11AM +1300, Thomas Munro wrote:
>On Tue, Dec 17, 2019 at 10:02 AM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On Mon, Dec 16, 2019 at 03:09:22PM +0100, Tomas Vondra wrote:
>> >On Sat, Dec 14, 2019 at 02:39:12PM +0100, Tomas Vondra wrote:
>> >>On Sat, Dec 14, 2019 at 05:01:05PM +1300, Thomas Munro wrote:
>> >>>On Wed, Dec 11, 2019 at 6:32 AM Tomas Vondra
>> >>><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> >>>>The annoying thing is that it's the workers that crash, and the leader
>> >>>>failed to notice that, so it was waiting in WaitForParallelWorkersToExit
>> >>>>forever. Not sure what the issue is.
>> >>>
>> >>>That's bad. Some edge case in
>> >>>WaitForBackgroundWorkerShutdown()/GetBackgroundWorkerPid() that is
>> >>>b0rked perhaps? Can you reproduce it, and tell us what the stack
>> >>>looks like, maybe on a new thread?
>> >>
>> >>Sure. I'll try reproducing it and get this extra info. I was planning to
>> >>do that, but then I lost the data for the report because of a silly
>> >>mistake. Early next week, hopefully.
>> >>
>> >
>> >I've tried reproducing this, but so far no luck. I've only managed to do
>> >two runs, though, because this particular machine only has space for the
>> >1TB of temp files on spinning rust. I wonder if it's possible to
>> >reproduce the issue with a smaller data set? I've tried using smaller
>> >work_mem, but then it does not trigger the issue :-(
>> >
>>
>> And of course, the very next run after I send the "can't reproduce"
>> message actually reproduces the issue. How typical ...
>
>Thanks for reproducing this.
>
>> 2019-12-16 20:25:00.617 CET [14081] ERROR: invalid DSA memory alloc request size 1711276032
>> 2019-12-16 20:25:00.617 CET [14081] STATEMENT: explain analyze select count(*) from sevenb t1 join sevenb t2 using (i);
>> 2019-12-16 20:25:00.617 CET [29417] ERROR: invalid DSA memory alloc request size 1711276032
>> 2019-12-16 20:25:00.617 CET [29417] CONTEXT: parallel worker
>
>Ok, so 14081 raised an error and exited, ...
>
>> 2019-12-16 20:25:00.746 CET [29408] LOG: background worker "parallel worker" (PID 14081) exited with exit code 1
>
>... and then the leader cancelled the query on 5 other workers,
>
>> 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14079) exited with exit code 1
>> 2019-12-16 20:25:00.816 CET [29408] LOG: background worker "parallel worker" (PID 14085) exited with exit code 1
>> 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14084) exited with exit code 1
>> 2019-12-16 20:25:00.817 CET [29408] LOG: background worker "parallel worker" (PID 14080) exited with exit code 1
>> 2019-12-16 20:25:00.818 CET [29408] LOG: background worker "parallel worker" (PID 14078) exited with exit code 1
>
>... but somehow there were 2 more workers that didn't get the message
>and are waiting:
>
>> #0 0x00007f64684d2bfb in epoll_wait (epfd=6, events=0x55d8873d53d8, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
>> #1 0x000055d8864d4e3c in WaitEventSetWaitBlock (set=0x55d8873d5378, cur_timeout=-1, occurred_events=0x7ffc6e805130, nevents=1) at latch.c:1080
>> #2 0x000055d8864d4d00 in WaitEventSetWait (set=0x55d8873d5378, timeout=-1, occurred_events=0x7ffc6e805130, nevents=1, wait_event_info=134217742) at latch.c:1032
>> #3 0x000055d8864e1848 in ConditionVariableTimedSleep (cv=0x7f63602a060c, timeout=-1, wait_event_info=134217742) at condition_variable.c:186
>> #4 0x000055d8864e17b2 in ConditionVariableSleep (cv=0x7f63602a060c, wait_event_info=134217742) at condition_variable.c:126
>> #5 0x000055d8864d05f2 in BarrierArriveAndWait (barrier=0x7f63602a05f4, wait_event_info=134217742) at barrier.c:191
>> #6 0x000055d886328d4e in ExecParallelHashIncreaseNumBatches (hashtable=0x55d887481e70) at nodeHash.c:1189
>
>They're waiting on a condition variable, in a barrier, while trying to
>coordinate an increase in the number of batches, while hashing. I
>suspect the details of what exactly it's waiting for are not relevant
>here though. ConditionVariableTimedSleep() contains a standard latch
>wait loop with CHECK_FOR_INTERRUPTS(), and parallel workers run with
>the standard SIGTERM handler die(). So what I expect to happen here
>is:
>
>1. DestroyParallelContext() calls TerminateBackgroundWorker() for all workers.
>2. TerminateBackgroundWorker() sets slot->terminate, and signals the
>postmaster.
>3. The postmaster sees the terminate flag and does kill(worker, SIGTERM).
>4. The worker handles SIGTERM by setting ProcDiePending and setting
>its own latch.
>5. The ConditionVariableTimedSleep()'s syscall returs,
>CHECK_FOR_INTERRUPTS() runs and sees ProcDiePending.
>
>So how did it escape this fate? Ooooh. Now I'm wondering if the
>problem is that ConditionVariablePrepareToSleep() contains a call to
>ResetLatch(), but doesn't then run CHECK_FOR_INTERRUPTS(). Perhaps
>that ate the latch that was set by die(). There isn't another
>CHECK_FOR_INTERRUPTS() until *after* the WaitEventSetWait() in
>ConditionVariableTimedSleep().
>
>> I'll leave the system in this "stuck" space in case we need to do
>> further investigation.
>
>What are the value sof InterruptHoldoffCount, CritSectionCount,
>InterruptPending and ProcDiePending in the stuck workers?

In both processes it's like this:

(gdb) p InterruptHoldoffCount
$1 = 0
(gdb) p CritSectionCount
$2 = 0
(gdb) p InterruptPending
$3 = 1
(gdb) p ProcDiePending
$5 = 1

So yeah, that probably means the theory about CHECK_FOR_INTERRUPTS call
missing right after latch reset is correct.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2019-12-17 23:36:57 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Previous Message Thomas Munro 2019-12-17 22:21:11 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash