Re: BUG #15821: Parallel Workers with functions and auto_explain: ERROR: could not find key 3 in shm TOC

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ch+pg(at)zeha(dot)at
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15821: Parallel Workers with functions and auto_explain: ERROR: could not find key 3 in shm TOC
Date: 2019-06-03 19:21:01
Message-ID: 17609.1559589661@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
> We have enabled auto_explain and see errors on PostgreSQL 11.3 when
> SELECTing from a user defined function. No such crashes have been
> observed on 10.7.

I think that you didn't give a complete dump of relevant settings,
but after some fooling around I was able to reproduce this error,
and the cause is this: auto_explain hasn't a single clue about
parallel query.

1. In the parent process, we have a parallelizable hash join being
executed in a statement inside a function. Since
auto_explain.log_nested_statements is not enabled, auto_explain
does not deem that it should trace the statement, so the query
starts up with estate->es_instrument = 0, and therefore
ExecHashInitializeDSM chooses not to create any shared
SharedHashInfo area.

2. In the worker processes, auto_explain manages to grab execution
control when ParallelQueryMain calls ExecutorStart, thanks to being
in ExecutorStart_hook. Having no clue what's going on, it decides
that this is a new top-level query that it should trace, and it
sets some bits in queryDesc->instrument_options.

3. When the workers get to ExecHashInitializeWorker, they see that
instrumentation is active so they try to look up the SharedHashInfo.
Kaboom.

I'm inclined to think that explain_ExecutorStart should simply
keep its hands off of everything when in a parallel worker;
if instrumentation is required, that'll be indicated by options
passed down from the parent process. It looks like this could
conveniently be merged with the rate-sampling logic by forcing
current_query_sampled to false when IsParallelWorker().

Likely this should be back-patched all the way to 9.6. I'm
not sure how we managed to avoid noticing it before now,
but there are probably ways to cause visible trouble in
any release that has any parallel query support.

regards, tom lane

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2019-06-03 20:38:30 Re: BUG #15828: Server crashes inside CloneRowTriggersToPartition
Previous Message Ahmed MARFOUK 2019-06-03 19:06:49 Re: ste application name for psql command line query

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-06-03 19:21:48 Fix runtime errors from -fsanitize=undefined
Previous Message Robert Haas 2019-06-03 19:15:04 Re: Pinned files at Windows