[parallel query] random server crash while running tpc-h query on power2

From: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: [parallel query] random server crash while running tpc-h query on power2
Date: 2016-08-13 05:40:32
Message-ID: CAGPqQf3wioMD+xP9QFqZYWEGjYMHSJbec4Rwv8ML0NOzSWOVeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,

Recently while running tpc-h queries on postgresql master branch, I am
noticed
random server crash. Most of the time server crash coming while turn tpch
query
number 3 - (but its very random).

Call Stack of server crash:

(gdb) bt
#0 0x00000000102aa9ac in ExplainNode (planstate=0x1001a7471d8,
ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0,
es=0x1001a65ad48)
at explain.c:1516
#1 0x00000000102aaeb8 in ExplainNode (planstate=0x1001a746b50,
ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0,
es=0x1001a65ad48)
at explain.c:1599
#2 0x00000000102aaeb8 in ExplainNode (planstate=0x1001a7468b8,
ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0,
es=0x1001a65ad48)
at explain.c:1599
#3 0x00000000102aaeb8 in ExplainNode (planstate=0x1001a745e48,
ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0,
es=0x1001a65ad48)
at explain.c:1599
#4 0x00000000102aaeb8 in ExplainNode (planstate=0x1001a745bb0,
ancestors=0x1001a745348, relationship=0x10913d10 "Outer", plan_name=0x0,
es=0x1001a65ad48)
at explain.c:1599
#5 0x00000000102aaeb8 in ExplainNode (planstate=0x1001a745870,
ancestors=0x1001a745348, relationship=0x0, plan_name=0x0, es=0x1001a65ad48)
at explain.c:1599
#6 0x00000000102a81d8 in ExplainPrintPlan (es=0x1001a65ad48,
queryDesc=0x1001a7442a0) at explain.c:599
#7 0x00000000102a7e20 in ExplainOnePlan (plannedstmt=0x1001a744208,
into=0x0, es=0x1001a65ad48,
queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose)
select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as
revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment
= 'BUILD"..., params=0x0, planduration=0x3fffe17c6488) at explain.c:515
#8 0x00000000102a7968 in ExplainOneQuery (query=0x1001a65ab98, into=0x0,
es=0x1001a65ad48,
queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose)
select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as
revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment
= 'BUILD"..., params=0x0) at explain.c:357
#9 0x00000000102a74b8 in ExplainQuery (stmt=0x1001a6fe468,
queryString=0x1001a6a8c08 "explain (analyze,buffers,verbose)
select\n\tl_orderkey,\n\tsum(l_extendedprice * (1 - l_discount)) as
revenue,\n\to_orderdate,\n\to_shippriority\nfrom\n\tcustomer,\n\torders,\n\tlineitem\nwhere\n\tc_mktsegment
= 'BUILD"..., params=0x0, dest=0x1001a65acb0) at explain.c:245

(gdb) p *w
$2 = {num_workers = 2139062143, instrument = 0x1001af8a8d0}
(gdb) p planstate
$3 = (PlanState *) 0x1001a7471d8
(gdb) p *planstate
$4 = {type = T_HashJoinState, plan = 0x1001a740730, state = 0x1001a745758,
instrument = 0x1001a7514a8, worker_instrument = 0x1001af8a8c8,
targetlist = 0x1001a747448, qual = 0x0, lefttree = 0x1001a74a0e8,
righttree = 0x1001a74f2f8, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0,
ps_ResultTupleSlot = 0x1001a750c10, ps_ExprContext = 0x1001a7472f0,
ps_ProjInfo = 0x1001a751220, ps_TupFromTlist = 0 '\000'}
(gdb) p *planstate->worker_instrument
$5 = {num_workers = 2139062143, instrument = 0x1001af8a8d0}
(gdb) p n
$6 = 13055
(gdb) p n
$7 = 13055

Here its clear that work_instrument is either corrupted or Un-inililized
that is the
reason its ending up with server crash.

With bit more debugging and looked at git history I found that issue
started coming
with commit af33039317ddc4a0e38a02e2255c2bf453115fd2. gather_readnext()
calls
ExecShutdownGatherWorkers() when nreaders == 0. ExecShutdownGatherWorkers()
calls ExecParallelFinish() which collects the instrumentation before
marking
ParallelExecutorInfo to finish. ExecParallelRetrieveInstrumentation() do
the allocation
of planstate->worker_instrument.

With commit af33039317 now we calling the gather_readnext() with per-tuple
context,
but with nreader == 0 with ExecShutdownGatherWorkers() we end up with
allocation
of planstate->worker_instrument into per-tuple context - which is wrong.

Now fix can be:

1) Avoid calling ExecShutdownGatherWorkers() from the gather_readnext() and
let
ExecEndGather() do that things. But with this change, gather_readread() and
gather_getnext() depend on planstate->reader structure to continue reading
tuple.
Now either we can change those condition to be depend on
planstate->nreaders or
just pfree(planstate->reader) into gather_readnext() instead of calling
ExecShutdownGatherWorkers().

2) Teach ExecParallelRetrieveInstrumentation() to do allocation of
planstate->worker_instrument into proper memory context.

Attaching patch, which fix the issue with approach 1).

Regards.

Rushabh Lathia
www.EnterpriseDB.com

Attachment Content-Type Size
explain_server_crash_parallel.patch text/x-patch 520 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-08-13 08:36:05 Re: [parallel query] random server crash while running tpc-h query on power2
Previous Message Thomas Munro 2016-08-13 01:22:03 Re: Server crash due to SIGBUS(Bus Error) when trying to access the memory created using dsm_create().