Re: aggregate crash

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: aggregate crash
Date: 2020-01-16 03:16:30
Message-ID: 20200116031630.7ip7lid23tpyhna2@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-01-15 12:47:47 -0800, Andres Freund wrote:
> FWIW, I'm working on narrowing it down to something small. I can
> reliably trigger the bug, and I understand the mechanics, I
> think. Interestingly enough the reproducer currently only triggers on
> v12, not on v11 and before.

That's just happenstance due to allocation changes in plpgsql,
though. The attached small reproducer, for me, reliably triggers crashes
on 10 - master.

It's hard to hit intentionally, because plpgsql does a datumCopy() to
its non-null return value, which means that to hit the bug, one needs
different numbers of allocations between setting up the transition value
with transvalueisnull = true, transvalue = 0xsomepointer (because
plpgsql doesn't copy NULLs), and the transition output with
transvalueisnull = false, transvalue = 0xsomepointer. Which is necessary
to trigger the bug, as it's then not reparented into a long lived enough
context. To be then freed/accessed for the next group input value.

I think this is too finnicky to actually keep as a regression test.

The bug, in a way, exists all the way back, but it's a bit harder to
create NULL values where the datum component isn't 0.

To fix I suggest we, in all branches, do the equivalent of adding
something like:
diff --git i/src/backend/executor/execExprInterp.c w/src/backend/executor/execExprInterp.c
index 790380051be..3260a63ac6b 100644
--- i/src/backend/executor/execExprInterp.c
+++ w/src/backend/executor/execExprInterp.c
@@ -4199,6 +4199,12 @@ ExecAggTransReparent(AggState *aggstate, AggStatePerTrans pertrans,
pertrans->transtypeByVal,
pertrans->transtypeLen);
}
+ else
+ {
+ /* ensure datum component is 0 for NULL transition values */
+ newValue = (Datum) 0;
+ }
+
if (!oldValueIsNull)
{
if (DatumIsReadWriteExpandedObject(oldValue,

and a comment explaining why it's (now) safe to rely on datum
comparisons for
if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))

I don't think it makes sense to add something like it to the byval case
- there's plenty other ways a function returning != 0 with
fcinfo->isnull == true can cause such values to exist. And that's
longstanding.

A separate question is whether it's worth adding code to
e.g. EEO_CASE(EEOP_FUNCEXPR_STRICT) also resetting *op->resvalue to
(Datum) 0. I don't personally don't think ensuring the datum is always
0 when isnull true is all that helpful, if we can't guarantee it
everywhere. So I'm a bit loathe to add cycles to places that don't need
it, and are hot.

Regards,

Andres

Attachment Content-Type Size
xrepro.sql application/x-sql 1.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-01-16 03:22:33 Re: Setting min/max TLS protocol in clientside libpq
Previous Message Thomas Munro 2020-01-16 02:59:55 Re: pgindent && weirdness