Re: Failure on sittella

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: Failure on sittella
Date: 2017-01-23 13:59:35
Message-ID: 17555.1485179975@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> There's a buildfarm failure (just on sittella so far) that I can't quite
> interpret from here. Timing wise it looks like it might be correlated
> with ea15e18677fc2eff3135023e27f69ed8821554ed - "Remove obsoleted code
> relating to targetlist SRF evaluation." but I'm not quite seeing the
> connection.
> Any chance you could get a backtrace for the failure?

Andrew's been investigating this for me off-list. The backtrace looks
like

#0 0x00007f7646ab7425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f7646abab8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x0000000000824337 in ExceptionalCondition
(conditionName=conditionName(at)entry=0x9a3f08 "!(((list) == ((List *)
((void *)0)) || ((((const Node*)((list)))->type) == T_OidList)))",
errorType=errorType(at)entry=0x86bb8b "FailedAssertion",
fileName=fileName(at)entry=0x9caa1c "list.c",
lineNumber=lineNumber(at)entry=509) at assert.c:54
#3 0x000000000063d06d in list_member_oid (list=<optimized out>,
datum=<optimized out>) at list.c:509
#4 list_member_oid (list=0x2486758, datum=datum(at)entry=23) at list.c:505
#5 0x000000000081ff0c in get_transform_fromsql (typid=23, langid=0,
trftypes=<optimized out>) at lsyscache.c:1844
#6 0x00007f7645f61fda in plperl_hash_from_tuple (tuple=0x2333a50,
tupdesc=0x2333938) at plperl.c:2993
#7 0x00007f7645f64415 in plperl_spi_execute_fetch_result
(tuptable=0x23334c8, processed=<optimized out>, status=<optimized out>)
at plperl.c:3120
#8 0x00007f7645f69c33 in plperl_spi_exec_prepared
(query=query(at)entry=0x234a0e0 "0x23a9468", attr=attr(at)entry=0x0,
argc=argc(at)entry=0, argv=argv(at)entry=0x234a0c0) at plperl.c:3633
#9 0x00007f7645f6b998 in XS__spi_exec_prepared (my_perl=<optimized
out>, cv=<optimized out>) at SPI.xs:141
#10 0x00007f7645c944ec in Perl_pp_entersub () from /usr/lib/libperl.so.5.14
#11 0x00007f7645c8bad6 in Perl_runops_standard () from
/usr/lib/libperl.so.5.14
#12 0x00007f7645c2780b in Perl_call_sv () from /usr/lib/libperl.so.5.14
#13 0x00007f7645f62980 in plperl_call_perl_func
(desc=desc(at)entry=0x2324f38, fcinfo=fcinfo(at)entry=0x2318b90) at plperl.c:2161
#14 0x00007f7645f67404 in plperl_func_handler (fcinfo=0x2318b90) at
plperl.c:2364
#15 plperl_call_handler (fcinfo=0x2318b90) at plperl.c:1785
#16 0x00000000005ff82e in ExecMakeFunctionResultNoSets
(fcache=0x2318b20, econtext=0x23188f8, isNull=0x23194d0 "`~1\002") at
execQual.c:1865

Evidently, this bit in plperl_hash_from_tuple:

else if ((funcid = get_transform_fromsql(tupdesc->attrs[i]->atttypid, current_call_data->prodesc->lang_oid, current_call_data->prodesc->trftypes)))

is passing garbage to get_transform_fromsql --- not only is the passed
trftypes not NIL, which it surely should be for a function with no
transform clauses, but the passed langid is zero which it surely should
not be.

It seems unlikely that the plperl function's prodesc has been clobbered,
since those are basically read-only once created. I guessed that the true
explanation is that current_call_data is pointing at garbage. The only
way that could possibly happen in this specific command (which lacks any
inline plperl DO block) is if the inner invocation of "just_die" fails to
correctly restore the value belonging to the outer invocation of
"indirect_die_caller" while on its way out through plperl_call_handler's
PG_CATCH block.

Guessing further, I asked Andrew to plaster "volatile" on
plperl_call_handler's save_call_data variable, and oldinterp for
good measure --- and he reports that that fixed it.

I believe that what we are looking at here is a gcc bug, probably the
same as or closely related to the issue in their PR56982:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56982

This also feels a heck of a lot like the TRY/CATCH issues I remember
seeing at Salesforce. What seems to happen is that the compiler spills
the local variable (ie, save_call_data) into a stack location before
reaching setjmp(), but then within the TRY block it sucks the local
variable back into a register and then re-uses the stack location for
something else. The eventual longjmp restores all the registers to
the way they were at setjmp, but it knows nothing of stack temporaries
so those don't get restored. Adding "volatile" fixes it by forcing
the variable to be given a dedicated stack location.

The discussion in PR56982 indicates that this bug affects all 4.7.x and
4.8.x releases of gcc and was only properly fixed in 4.9.0.

I don't have a clear explanation of why ea15e1867 seems to have triggered
it; that shouldn't really have changed any code in plperl_call_handler,
AFAICS. But compiler bugs are strange beasts.

The whole thing is scary as heck, but there's not much we can do about it
--- I don't really want to stick "volatile" on every variable that's
used in a PG_CATCH. We should, however, be prepared to do that at the
first sign of trouble.

I'm going to go volatile-ize those two variables in plperl_call_handler,
and maybe also the ones in plperl_inline_handler for luck.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-01-23 14:03:14 Re: Failure in commit_ts tap tests
Previous Message Amit Kapila 2017-01-23 13:38:24 Re: Checksums by default?