Re: [sqlsmith] Unpinning error in parallel worker

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andreas Seltenreich <seltenreich(at)gmx(dot)de>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [sqlsmith] Unpinning error in parallel worker
Date: 2017-03-27 05:53:32
Message-ID: CAEepm=2VvKCVXjg-REHTikoBUOhtfvfCyiJrT=zHDJyvMNd9pw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 27, 2017 at 8:38 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Mon, Mar 27, 2017 at 4:18 AM, Andreas Seltenreich <seltenreich(at)gmx(dot)de> wrote:
>> Hi,
>>
>> today's testing with master as of d253b0f6e3 yielded two clusters that
>> stopped processing queries. Symptoms:
>>
>> [...]
>
> Thanks Andreas. Investigating.

First, the hanging stems from reentering dsm_backend_shutdown and
trying to acquire DynamicSharedMemoryControlLock which we already
acquired further down the stack in dsm_unpin_segment when it raised an
error. That's obviously not great, but the real question is how we
reached this this-cannot-happen error condition.

I reproduced this by inserting a sleep before dsa_attach_in_place,
inserting a call to dsa_allocate into ExecInitParallelPlan so that the
executor's DSA area owns at least one segment, and then cancelling a
parallel query before the sleepy worker has managed to attach. The
DSA area is destroyed before the worker attaches, but the worker
doesn't know this, and goes on to destroy it again after it learns
that the query has been cancelled.

In an earlier version of DSA, attaching should have failed in this
scenario because the handle would be invalid. Based on complaints
about creating an extra DSM segment all the time even if we don't turn
out to need it, I implemented "in place" DSA areas where the control
object is in user-supplied shared memory, in this case in the parallel
query main DSM segment. But that created a new hazard: if you try to
attach to a piece of memory that contains the remains of a
already-destroyed DSA area, then we don't do anything to detect that.
Oops.

The attached patch fixes that one way: it detects refcnt == 0 as a
defunct DSA area and raises an error when you try to attach.

Another approach which I could explore would be to "reset" the DSA
area instead of destroying it when the last backend detaches. I'm not
sure if that would ever be useful as a feature in its own right, but
it would at least allow a very late worker to attach and then detach
in an orderly fashion in this query-cancelled case, so you wouldn't
get a different error message in the worker in this rare case.

Thoughts?

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
detect-late-dsa-attach-in-place.patch application/octet-stream 652 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2017-03-27 06:08:08 Re: Logical decoding on standby
Previous Message Amit Kapila 2017-03-27 05:51:20 Re: [POC] A better way to expand hash indexes.