Re: [sqlsmith] Unpinning error in parallel worker

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Jonathan Rudenberg <jonathan(at)titanous(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Seltenreich <seltenreich(at)gmx(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [sqlsmith] Unpinning error in parallel worker
Date: 2018-04-24 20:06:43
Message-ID: CAEepm=0Y_nw=kp-YZtnkvhySYxu4PPONWeF4ap=M05g-STgKbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg
<jonathan(at)titanous(dot)com> wrote:
> This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400 hanging backends.
>
> [...]
> #4 0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
> #5 0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493
> #6 0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266
> [...]

Thank you. These stacks are all blocked trying to acquire
DynamicSharedMemoryControlLock. My theory is that they can't because
one backend -- the one that emitted the error "FATAL: cannot unpin a
segment that is not pinned" -- is deadlocked against itself. After
emitting that error you can see from Andreas's "seabisquit" stack that
that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach()
which tries to acquire DynamicSharedMemoryControlLock again, even
though we already hold it at that point.

I'll write a patch to fix that unpleasant symptom. While holding
DynamicSharedMemoryControlLock we shouldn't raise any errors without
releasing it first, because the error handling path will try to
acquire it again. That's a horrible failure mode as you have
discovered.

But that isn't the root problem: we shouldn't be raising that error,
and I'd love to see the stack of the one process that did that and
then self-deadlocked. I will have another go at trying to reproduce
it here today.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan Rudenberg 2018-04-24 20:15:31 Re: [sqlsmith] Unpinning error in parallel worker
Previous Message Robert Haas 2018-04-24 19:49:20 Re: Oddity in tuple routing for foreign partitions