Re: [sqlsmith] Unpinning error in parallel worker

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andreas Seltenreich <seltenreich(at)gmx(dot)de>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [sqlsmith] Unpinning error in parallel worker
Date: 2017-03-29 05:31:41
Message-ID: CAEepm=23=vGz=CgVurPxBGV6SeOJ7YxSaAJKr_aH+f2cV4zrow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 27, 2017 at 6:53 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Mon, Mar 27, 2017 at 8:38 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> On Mon, Mar 27, 2017 at 4:18 AM, Andreas Seltenreich <seltenreich(at)gmx(dot)de> wrote:
>>> Hi,
>>>
>>> today's testing with master as of d253b0f6e3 yielded two clusters that
>>> stopped processing queries. Symptoms:
>>>
>>> [...]
>>
>> Thanks Andreas. Investigating.
>
> First, the hanging stems from reentering dsm_backend_shutdown and
> trying to acquire DynamicSharedMemoryControlLock which we already
> acquired further down the stack in dsm_unpin_segment when it raised an
> error. That's obviously not great, but the real question is how we
> reached this this-cannot-happen error condition.
>
> I reproduced this by inserting a sleep before dsa_attach_in_place,
> inserting a call to dsa_allocate into ExecInitParallelPlan so that the
> executor's DSA area owns at least one segment, and then cancelling a
> parallel query before the sleepy worker has managed to attach. The
> DSA area is destroyed before the worker attaches, but the worker
> doesn't know this, and goes on to destroy it again after it learns
> that the query has been cancelled.
>
> In an earlier version of DSA, attaching should have failed in this
> scenario because the handle would be invalid. Based on complaints
> about creating an extra DSM segment all the time even if we don't turn
> out to need it, I implemented "in place" DSA areas where the control
> object is in user-supplied shared memory, in this case in the parallel
> query main DSM segment. But that created a new hazard: if you try to
> attach to a piece of memory that contains the remains of a
> already-destroyed DSA area, then we don't do anything to detect that.
> Oops.
>
> The attached patch fixes that one way: it detects refcnt == 0 as a
> defunct DSA area and raises an error when you try to attach.
>
> Another approach which I could explore would be to "reset" the DSA
> area instead of destroying it when the last backend detaches. I'm not
> sure if that would ever be useful as a feature in its own right, but
> it would at least allow a very late worker to attach and then detach
> in an orderly fashion in this query-cancelled case, so you wouldn't
> get a different error message in the worker in this rare case.
>
> Thoughts?

Added to open items.

I considered whether the error message could be improved but it
matches the message for an existing similar case (where you try to
attach to an unknown handle).

The alternative approach I mentioned above doesn't seem warranted, as
you can already get various different failure messages depending on
timing.

Based on feedback on another thread about how to make reviewers' and
committers' jobs easier, here is a format-patch version with a short
description as raw material for a commit message, in case that is
helpful.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
detect-late-dsa-attach-v2.patch application/octet-stream 1.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-03-29 06:20:48 Prologue of set_append_rel_size() and partitioned tables
Previous Message Kyotaro HORIGUCHI 2017-03-29 05:21:12 Re: Multiple false-positive warnings from Valgrind