Re: [sqlsmith] Failed assertions on parallel worker shutdown

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andreas Seltenreich <seltenreich(at)gmx(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [sqlsmith] Failed assertions on parallel worker shutdown
Date: 2016-06-04 06:57:20
Message-ID: CAA4eK1J9hFvF4WyJ9kR1mZK0RcxjZvwvkQgvOPBwP1K0W728zA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 4, 2016 at 8:43 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, May 26, 2016 at 5:57 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> >
> > I am able to reproduce the assertion (it occurs once in two to three
times,
> > but always at same place) you have reported upthread with the above
query.
> > It seems to me, issue here is that while workers are writing tuples in
the
> > tuple queue, the master backend has detached from the queues. The
reason
> > why master backend has detached from tuple queues is because of Limit
> > clause, basically after processing required tuples as specified by Limit
> > clause, it calls shutdown of nodes in below part of code:
>
> I can't reproduce this assertion failure on master. I tried running
> 'make installcheck' and then running this query repeatedly in the
> regression database with and without
> parallel_setup_cost=parallel_tuple_cost=0, and got nowhere. Does that
> work for you, or do you have some other set of steps?
>

I have tried by populating pg_statistic table after running make
installcheck. The way to populate pg_statistic is to create lot of tables
and insert few rows in each table as mentioned in the end of mail upthread.
https://www.postgresql.org/message-id/CAA4eK1KOKGqmz9bGu%2BZ42qhRwMbm4R5rfnqsLCNqFs9j14jzEA%40mail.gmail.com

Today, again I tried reproducing it using same steps, but could not
reproduce it. This is a timing issue and difficult to reproduce, last time
also I have spent quite some time to reproduce it. I think we can fix the
issue as per analysis done by me last time and then let Andreas run his
tests to see if he could see the issue again.

> > I think the workers should stop processing tuples after the tuple
queues got
> > detached. This will not only handle above situation gracefully, but
will
> > allow to speed up the queries where Limit clause is present on top of
Gather
> > node. Patch for the same is attached with mail (this was part of
original
> > parallel seq scan patch, but was not applied and the reason as far as I
> > remember was we thought such an optimization might not be required for
> > initial version).
>
> This is very likely a good idea, but...
>
> > Another approach to fix this issue could be to reset mqh_partial_bytes
and
> > mqh_length_word_complete in shm_mq_sendv in case of SHM_MQ_DETACHED.
These
> > are currently reset only incase of success.
>
> ...I think we should do this too, because it's intended that calling
> shm_mq_sendv again after it previously returned SHM_MQ_DETACHED should
> again return SHM_MQ_DETACHED, not fail an assertion.
>

res = shm_mq_send_bytes(mqh, j, tmpbuf, nowait, &bytes_written);
mqh->mqh_partial_bytes += bytes_written;
+ if (res == SHM_MQ_DETACHED)
+ {
+ mqh->mqh_partial_bytes = 0;
+ return res;
+ }

In the above change, you are first adding bytes_written and then doing the
SHM_MQ_DETACHED check, whereas other place the check is done first which
seems to be right. I think it doesn't matter either way, but it is better
to be consistent. Also isn't it better to set mqh_length_word_complete as
false as next time this API is called, it should first try to write length
into buffer.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Emre Hasegeli 2016-06-04 11:59:25 Re: regexp_match() returning text
Previous Message Masahiko Sawada 2016-06-04 04:46:37 Re: Reviewing freeze map code