Re: [sqlsmith] Failed assertions on parallel worker shutdown

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andreas Seltenreich <seltenreich(at)gmx(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [sqlsmith] Failed assertions on parallel worker shutdown
Date: 2016-06-04 03:13:36
Message-ID: CA+TgmoYtdNMzwiOoAHzFnBPq6iHqurkwDEbhcXtMJh9T-qgihg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 26, 2016 at 5:57 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Tue, May 24, 2016 at 6:36 PM, Andreas Seltenreich <seltenreich(at)gmx(dot)de>
> wrote:
>>
>>
>> Each of the sent plans was collected when a worker dumped core due to
>> the failed assertion. More core dumps than plans were actually
>> observed, since with this failed assertion, multiple workers usually
>> trip on and dump core simultaneously.
>>
>> The following query corresponds to plan2:
>>
>> --8<---------------cut here---------------start------------->8---
>> select
>> pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
>> subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
>> (select unique1 from public.bprime limit 1 offset 9) as c5,
>> subq_0.c2 as c6
>> from
>> (select ref_0.tablename as c0, ref_0.inherited as c1,
>> ref_0.histogram_bounds as c2, 100 as c3
>> from
>> pg_catalog.pg_stats as ref_0
>> where 49 is not NULL limit 55) as subq_0
>> where true
>> limit 58;
>> --8<---------------cut here---------------end--------------->8---
>>
>
> I am able to reproduce the assertion (it occurs once in two to three times,
> but always at same place) you have reported upthread with the above query.
> It seems to me, issue here is that while workers are writing tuples in the
> tuple queue, the master backend has detached from the queues. The reason
> why master backend has detached from tuple queues is because of Limit
> clause, basically after processing required tuples as specified by Limit
> clause, it calls shutdown of nodes in below part of code:

I can't reproduce this assertion failure on master. I tried running
'make installcheck' and then running this query repeatedly in the
regression database with and without
parallel_setup_cost=parallel_tuple_cost=0, and got nowhere. Does that
work for you, or do you have some other set of steps?

> I think the workers should stop processing tuples after the tuple queues got
> detached. This will not only handle above situation gracefully, but will
> allow to speed up the queries where Limit clause is present on top of Gather
> node. Patch for the same is attached with mail (this was part of original
> parallel seq scan patch, but was not applied and the reason as far as I
> remember was we thought such an optimization might not be required for
> initial version).

This is very likely a good idea, but...

> Another approach to fix this issue could be to reset mqh_partial_bytes and
> mqh_length_word_complete in shm_mq_sendv in case of SHM_MQ_DETACHED. These
> are currently reset only incase of success.

...I think we should do this too, because it's intended that calling
shm_mq_sendv again after it previously returned SHM_MQ_DETACHED should
again return SHM_MQ_DETACHED, not fail an assertion. Can you see
whether the attached patch fixes this for you?

(Status update for Noah: I will provide another update regarding this
issue no later than Monday COB, US time. I assume that Amit will have
responded by then, and it should hopefully be clear what the next step
is at that point.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
dont-fail-mq-assert-v1.patch binary/octet-stream 1.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2016-06-04 03:27:05 Re: [BUGS] BUG #14155: bloom index error with unlogged table
Previous Message Amit Kapila 2016-06-04 03:12:04 Re: XTM & parallel search