Re: BUG #16213: segfault when running a query

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: matt(dot)jibson(at)gmail(dot)com
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16213: segfault when running a query
Date: 2020-01-17 04:03:46
Message-ID: 635.1579233826@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
> The above query produces an error in the server log:
> LOG: server process (PID 108) was terminated by signal 11: Segmentation
> fault

Yeah, duplicated here. (For anyone following along at home: you
can "create domain string as text", or just s/STRING/TEXT/g in the
given query. That type's not relevant to the problem.) The problem
is probably much more easily reproduced in a debug build, because it
boils down to a dangling-pointer bug. I duplicated it back to 9.4,
and it's probably older than that.

The direct cause of the crash is that by the time we get to ExecutorEnd,
there are dangling pointers in the es_tupleTable list. Those pointers
turn out to originate from ExecInitSubPlan's creation of TupleTableSlots
for the ProjectionInfo objects it creates when doing hashing. And the
reason they're dangling is that the subplan is inside a VALUES list,
and nodeValuesscan.c does this remarkably bletcherous thing:

* Build the expression eval state in the econtext's per-tuple memory.
* This is a tad unusual, but we want to delete the eval state again
* when we move to the next row, to avoid growth of memory
* requirements over a long values list.

It turns out that just below that, there's already some messy hacking
to deal with subplans in the VALUES list. But I guess we'd not hit
the case of a subplan using hashing within VALUES.

The attached draft patch fixes this by not letting ExecInitSubPlan hook
the slots it's making into the outer es_tupleTable list. Ordinarily
that would be bad because it exposes us to leaking resources, if the
slots aren't cleared before ending execution. But nodeSubplan.c is
already being (mostly) careful to clear those slots promptly, so it
doesn't cost us anything much to lose this backstop.

What that change fails to do is guarantee that there are no such
bugs elsewhere. In the attached I made nodeValuesscan.c assert that
nothing has added slots to the es_tupleTable list, but of course
that only catches cases where there's a live bug. Given how long
this case escaped detection, I don't feel like that's quite enough.
(Unsurprisingly, the existing regression tests don't trigger this
assert, even without the nodeSubplan.c fix.)

Another angle I've not run to ground is that the comments for the
existing subplan-related hacking in nodeValuesscan.c claim that
a problematic subplan could only occur in conjunction with LATERAL.
But there's no LATERAL in this example --- are those comments wrong
or obsolete, or just talking about a different case?

I didn't work on making a minimal test case for the regression tests,
either.

Anyway, thanks for the report!

regards, tom lane

Attachment Content-Type Size
fix-for-hashed-subplan-in-VALUES-1.patch text/x-diff 5.4 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2020-01-17 04:23:56 BUG #16214: Settings is not copy
Previous Message Dilip Kumar 2020-01-17 03:13:28 Re: Reorderbuffer crash during recovery

Browse pgsql-hackers by date

  From Date Subject
Next Message Asim R P 2020-01-17 04:04:05 Unnecessary delay in streaming replication due to replay lag
Previous Message Dilip Kumar 2020-01-17 03:13:28 Re: Reorderbuffer crash during recovery