Re: Why is infinite_recurse test suddenly failing?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, mark(at)2ndQuadrant(dot)com
Subject: Re: Why is infinite_recurse test suddenly failing?
Date: 2019-05-10 15:38:57
Message-ID: 27079.1557502737@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We just got another one of these, on yet another member of Mark's
ppc64 armada:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shoveler&dt=2019-05-10%2014%3A04%3A34

Now we have a stack trace (thanks Mark!), but it is pretty unsurprising:

Core was generated by `postgres: debian regression [local] SELECT '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
2748 malloc.c: No such file or directory.
#0 sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
#1 0x00003fff915bedc8 in _int_malloc (av=0x3fff916e0d28 <main_arena>, bytes=8192) at malloc.c:3865
#2 0x00003fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928
#3 0x00000000106acfd8 in AllocSetContextCreateInternal (parent=0x1000babdad0, name=0x1085508c "inline_function", minContextSize=<optimized out>, initBlockSize=<optimized out>, maxBlockSize=8388608) at aset.c:477
#4 0x00000000103d5e00 in inline_function (funcid=20170, result_type=<optimized out>, result_collid=<optimized out>, input_collid=<optimized out>, funcvariadic=<optimized out>, func_tuple=<optimized out>, context=0x3fffe3da15d0, args=<optimized out>) at clauses.c:4459
#5 simplify_function (funcid=<optimized out>, result_type=<optimized out>, result_typmod=<optimized out>, result_collid=<optimized out>, input_collid=<optimized out>, args_p=<optimized out>, funcvariadic=<optimized out>, process_args=<optimized out>, allow_non_const=<optimized out>, context=<optimized out>) at clauses.c:4040
#6 0x00000000103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, context=0x3fffe3da15d0) at clauses.c:2474
#7 0x00000000103511bc in expression_tree_mutator (node=<optimized out>, mutator=0x103d2b10 <eval_const_expressions_mutator>, context=0x3fffe3da15d0) at nodeFuncs.c:2893
#8 0x00000000103d2cbc in eval_const_expressions_mutator (node=0x1000babe9c0, context=0x3fffe3da15d0) at clauses.c:3606
#9 0x00000000103510c8 in expression_tree_mutator (node=<optimized out>, mutator=<optimized out>, context=<optimized out>) at nodeFuncs.c:2942
#10 0x00000000103d2cbc in eval_const_expressions_mutator (node=0x1000babea40, context=0x3fffe3da15d0) at clauses.c:3606
#11 0x00000000103d2ae8 in eval_const_expressions (root=<optimized out>, node=<optimized out>) at clauses.c:2266
#12 0x00000000103b6264 in preprocess_expression (root=0x1000babee18, expr=<optimized out>, kind=1) at planner.c:1087
#13 0x00000000103b496c in subquery_planner (glob=<optimized out>, parse=<optimized out>, parent_root=<optimized out>, hasRecursion=<optimized out>, tuple_fraction=<optimized out>) at planner.c:769
#14 0x00000000103b3c58 in standard_planner (parse=<optimized out>, cursorOptions=<optimized out>, boundParams=<optimized out>) at planner.c:406
#15 0x00000000103b3a68 in planner (parse=<optimized out>, cursorOptions=<optimized out>, boundParams=<optimized out>) at planner.c:275
#16 0x00000000104cc2cc in pg_plan_query (querytree=0x1000babe7f8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#17 0x00000000102ef850 in init_execution_state (lazyEvalOK=<optimized out>, queryTree_list=<optimized out>, fcache=<optimized out>) at functions.c:507
#18 init_sql_fcache (finfo=<optimized out>, collation=<optimized out>, lazyEvalOK=<optimized out>) at functions.c:770
#19 fmgr_sql (fcinfo=<optimized out>) at functions.c:1053
#20 0x00000000102cef24 in ExecInterpExpr (state=<optimized out>, econtext=<optimized out>, isnull=<optimized out>) at execExprInterp.c:625
#21 0x00000000102cddb8 in ExecInterpExprStillValid (state=0x1000bab41e8, econtext=0x1000bab3ed8, isNull=<optimized out>) at execExprInterp.c:1769
#22 0x0000000010314f10 in ExecEvalExprSwitchContext (state=0x1000bab41e8, econtext=0x1000bab3ed8, isNull=<optimized out>) at ../../../src/include/executor/executor.h:307
#23 ExecProject (projInfo=0x1000bab41e0) at ../../../src/include/executor/executor.h:341
#24 ExecResult (pstate=<optimized out>) at nodeResult.c:136
#25 0x00000000102e319c in ExecProcNodeFirst (node=0x1000bab3dc0) at execProcnode.c:445
#26 0x00000000102d9c94 in ExecProcNode (node=<optimized out>) at ../../../src/include/executor/executor.h:239
#27 ExecutePlan (estate=<optimized out>, planstate=<optimized out>, use_parallel_mode=false, operation=<optimized out>, numberTuples=<optimized out>, direction=<optimized out>, dest=<optimized out>, execute_once=<optimized out>, sendTuples=<optimized out>) at execMain.c:1648
#28 standard_ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>) at execMain.c:365
#29 0x00000000102d9ac8 in ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>) at execMain.c:309
#30 0x00000000102efe84 in postquel_getnext (es=<optimized out>, fcache=<optimized out>) at functions.c:867
#31 fmgr_sql (fcinfo=<optimized out>) at functions.c:1153
#32 0x00000000102cef24 in ExecInterpExpr (state=<optimized out>, econtext=<optimized out>, isnull=<optimized out>) at execExprInterp.c:625
#33 0x00000000102cddb8 in ExecInterpExprStillValid (state=0x1000baa8158, econtext=0x1000baa7e48, isNull=<optimized out>) at execExprInterp.c:1769
#34 0x0000000010314f10 in ExecEvalExprSwitchContext (state=0x1000baa8158, econtext=0x1000baa7e48, isNull=<optimized out>) at ../../../src/include/executor/executor.h:307
#35 ExecProject (projInfo=0x1000baa8150) at ../../../src/include/executor/executor.h:341
#36 ExecResult (pstate=<optimized out>) at nodeResult.c:136
#37 0x00000000102e319c in ExecProcNodeFirst (node=0x1000baa7d30) at execProcnode.c:445

... lots and lots of repetitions ...

#11809 0x00000000102e319c in ExecProcNodeFirst (node=0x10008c01e90) at execProcnode.c:445
#11810 0x00000000102d9c94 in ExecProcNode (node=<optimized out>) at ../../../src/include/executor/executor.h:239
#11811 ExecutePlan (estate=<optimized out>, planstate=<optimized out>, use_parallel_mode=false, operation=<optimized out>, numberTuples=<optimized out>, direction=<optimized out>, dest=<optimized out>, execute_once=<optimized out>, sendTuples=<optimized out>) at execMain.c:1648
#11812 standard_ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>) at execMain.c:365
#11813 0x00000000102d9ac8 in ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>) at execMain.c:309
#11814 0x00000000104d39ec in PortalRunSelect (portal=0x10008be9de8, forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:929
#11815 0x00000000104d34c0 in PortalRun (portal=0x10008be9de8, count=<optimized out>, isTopLevel=<optimized out>, run_once=<optimized out>, dest=<optimized out>, altdest=<optimized out>, completionTag=0x3fffe3ecd6c0 "") at pquery.c:770
#11816 0x00000000104d1bc4 in exec_simple_query (query_string=<optimized out>) at postgres.c:1215
#11817 0x00000000104ced50 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:4249
#11818 0x00000000104110fc in BackendRun (port=<optimized out>) at postmaster.c:4430
#11819 BackendStartup (port=<optimized out>) at postmaster.c:4121
#11820 ServerLoop () at postmaster.c:1704
#11821 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster.c:1377
#11822 0x000000001034def4 in main (argc=8, argv=0x10008b7efb0) at main.c:228

So that lets out any theory that somehow we're getting into a weird
control path that misses calling check_stack_depth;
expression_tree_mutator does so for one, and it was called just nine
stack frames down from the crash.

Casting about for possible explanations, I notice that the failure
seems to have occurred at a nesting depth of 982 SQL-function calls
((11809 - 25)/12). I'd previously scraped the buildfarm database
to find out what nesting depths we normally get to before erroring
out, by counting the number of "infinite_recurse" context messages
in the postmaster log. Here's the entry for shoveler:

shoveler | 2019-05-03 14:19:19 | 1674

So this failed at substantially less stack depth than it's successfully
consumed in other runs, and *very* substantially less stack than ought
to be there, considering we pay attention to getrlimit in setting
max_stack_depth and add quite a generous amount of slop too.

I am wondering if, somehow, the stack depth limit seen by the postmaster
sometimes doesn't apply to its children. That would be pretty wacko
kernel behavior, especially if it's only intermittently true.
But we're running out of other explanations.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2019-05-10 15:58:03 Re: Bug in reindexdb's error reporting
Previous Message Alvaro Herrera 2019-05-10 15:33:23 Re: Bug in reindexdb's error reporting