Re: [sqlsmith] Failed assertions on parallel worker shutdown

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andreas Seltenreich <seltenreich(at)gmx(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [sqlsmith] Failed assertions on parallel worker shutdown
Date: 2016-05-26 09:57:40
Message-ID: CAA4eK1KOKGqmz9bGu+Z42qhRwMbm4R5rfnqsLCNqFs9j14jzEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 24, 2016 at 6:36 PM, Andreas Seltenreich <seltenreich(at)gmx(dot)de>
wrote:
>
>
> Each of the sent plans was collected when a worker dumped core due to
> the failed assertion. More core dumps than plans were actually
> observed, since with this failed assertion, multiple workers usually
> trip on and dump core simultaneously.
>
> The following query corresponds to plan2:
>
> --8<---------------cut here---------------start------------->8---
> select
> pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
> subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
> (select unique1 from public.bprime limit 1 offset 9) as c5,
> subq_0.c2 as c6
> from
> (select ref_0.tablename as c0, ref_0.inherited as c1,
> ref_0.histogram_bounds as c2, 100 as c3
> from
> pg_catalog.pg_stats as ref_0
> where 49 is not NULL limit 55) as subq_0
> where true
> limit 58;
> --8<---------------cut here---------------end--------------->8---
>
>
I am able to reproduce the assertion (it occurs once in two to three times,
but always at same place) you have reported upthread with the above query.
It seems to me, issue here is that while workers are writing tuples in the
tuple queue, the master backend has detached from the queues. The reason
why master backend has detached from tuple queues is because of Limit
clause, basically after processing required tuples as specified by Limit
clause, it calls shutdown of nodes in below part of code:

ExecutePlan()
{
..
if (TupIsNull(slot))
{
/* Allow nodes to release or shut down resources. */
(void) ExecShutdownNode(planstate);
break;
}
..
}

Now, it is quite possible that the worker has written part of it's data,
after which the queue got detached. The callers of shm_mq
(tqueueReceiveSlot/shm_mq_send) doesn't handle SHM_MQ_DETACHED due to which
it keeps on sending more data (next tuple) which leads to the assertion in
below code:
shm_mq_sendv()
{
..
/* Write the actual data bytes into the buffer. */
Assert(mqh->mqh_partial_bytes <= nbytes);
..
}

I think the workers should stop processing tuples after the tuple queues
got detached. This will not only handle above situation gracefully, but
will allow to speed up the queries where Limit clause is present on top of
Gather node. Patch for the same is attached with mail (this was part of
original parallel seq scan patch, but was not applied and the reason as far
as I remember was we thought such an optimization might not be required for
initial version).

Another approach to fix this issue could be to reset mqh_partial_bytes and
mqh_length_word_complete in shm_mq_sendv in case of SHM_MQ_DETACHED.
These are currently reset only incase of success.

I have added this issue in list of PostgreSQL 9.6 open items @
https://wiki.postgresql.org/wiki/PostgreSQL_9.6_Open_Items

The steps to reproduce it on regression database is:
1. We need to create enough rows in pg_statistic, so that parallel plan can
be selected.
2. Used below procedure to ensure sufficient rows gets created in
pg_statistic
CREATE OR REPLACE FUNCTION populate_pg_stat() RETURNS int AS
$BODY$
DECLARE
count int;
BEGIN
FOR count IN 1 .. 50000
LOOP
Execute 'Create table t' || count || '(c1 int)';
Execute 'insert into t' || count || ' values (generate_series(1,10))';
Execute 'Analyze t' || count;
End Loop;
Return 1;
END
$BODY$
LANGUAGE plpgsql;
3.
set parallel_tuple_cost = 0
set parallel_setup_cost = 0
execute query -
Explain Analyze select
pg_catalog.pg_stat_get_bgwriter_requested_checkpoints() as c0,
subq_0.c3 as c1, subq_0.c1 as c2, 31 as c3, 18 as c4,
(select unique1 from public.bprime limit 1 offset 9) as c5,
subq_0.c2 as c6
from
(select ref_0.tablename as c0, ref_0.inherited as c1,
ref_0.histogram_bounds as c2, 100 as c3
from
pg_catalog.pg_stats as ref_0
where 49 is not NULL limit 55) as subq_0
where true
limit 58;

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
stop_processing_tuples_detached_queue_v1.patch application/octet-stream 14.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-05-26 10:01:36 Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618”: Permission denied”
Previous Message Kyotaro HORIGUCHI 2016-05-26 09:55:51 Fix a failure of pg_dump with !HAVE_LIBZ