Cancelling parallel query leads to segfault

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Cancelling parallel query leads to segfault
Date: 2018-01-28 03:45:31
Message-ID: 20180128034531.h6o4w3727ifof3jy@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Peter (and others who mucked around with related code),

While testing another patch I found that cancelling a parallel query on
master segfaults the leader in an interesting manner:

#0 0x00005648721cb361 in tas (lock=0x7fbd8e025360 <error: Cannot access memory at address 0x7fbd8e025360>) at /home/andres/src/postgresql/src/include/storage/s_lock.h:228
#1 0x00005648721cc2d6 in shm_mq_detach_internal (mq=0x7fbd8e025360) at /home/andres/src/postgresql/src/backend/storage/ipc/shm_mq.c:812
#2 0x00005648721cc25c in shm_mq_detach (mqh=0x564874956bb0) at /home/andres/src/postgresql/src/backend/storage/ipc/shm_mq.c:778
#3 0x000056487201e8fc in ExecParallelFinish (pei=0x5648749514f8) at /home/andres/src/postgresql/src/backend/executor/execParallel.c:1011
#4 0x000056487203706c in ExecShutdownGatherWorkers (node=0x564874907148) at /home/andres/src/postgresql/src/backend/executor/nodeGather.c:383
#5 0x00005648720370b9 in ExecShutdownGather (node=0x564874907148) at /home/andres/src/postgresql/src/backend/executor/nodeGather.c:400
#6 0x0000564872036c84 in ExecEndGather (node=0x564874907148) at /home/andres/src/postgresql/src/backend/executor/nodeGather.c:241
#7 0x0000564872020eff in ExecEndNode (node=0x564874907148) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:609
#8 0x00005648720522f5 in ExecEndSort (node=0x564874906eb0) at /home/andres/src/postgresql/src/backend/executor/nodeSort.c:260
#9 0x0000564872021053 in ExecEndNode (node=0x564874906eb0) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:695
#10 0x000056487203193e in ExecEndAgg (node=0x564874906788) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:3329
#11 0x0000564872021075 in ExecEndNode (node=0x564874906788) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:703
#12 0x000056487201a102 in ExecEndPlan (planstate=0x564874906788, estate=0x564874906538) at /home/andres/src/postgresql/src/backend/executor/execMain.c:1604
#13 0x000056487201823b in standard_ExecutorEnd (queryDesc=0x56487485e808) at /home/andres/src/postgresql/src/backend/executor/execMain.c:493
#14 0x0000564872018167 in ExecutorEnd (queryDesc=0x56487485e808) at /home/andres/src/postgresql/src/backend/executor/execMain.c:464
#15 0x0000564871fb136e in PortalCleanup (portal=0x5648748a7b48) at /home/andres/src/postgresql/src/backend/commands/portalcmds.c:301
#16 0x00005648723a2148 in AtAbort_Portals () at /home/andres/src/postgresql/src/backend/utils/mmgr/portalmem.c:781
#17 0x0000564871e84168 in AbortTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:2540
#18 0x0000564871e865e2 in AbortOutOfAnyTransaction () at /home/andres/src/postgresql/src/backend/access/transam/xact.c:4387
#19 0x0000564872378c40 in ShutdownPostgres (code=1, arg=0) at /home/andres/src/postgresql/src/backend/utils/init/postinit.c:1156
#20 0x00005648721c2a63 in shmem_exit (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:228
#21 0x00005648721c293a in proc_exit_prepare (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:185
#22 0x00005648721c28a2 in proc_exit (code=1) at /home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:102
#23 0x00005648723638cd in errfinish (dummy=0) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:543
#24 0x00005648721f5f54 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:2921
#25 0x0000564872022630 in ExecScanFetch (node=0x564874907c18, accessMtd=0x564872050b32 <SeqNext>, recheckMtd=0x564872050bfd <SeqRecheck>) at /home/andres/src/postgresql/src/backend/executor/execScan.c:41
#26 0x0000564872022840 in ExecScan (node=0x564874907c18, accessMtd=0x564872050b32 <SeqNext>, recheckMtd=0x564872050bfd <SeqRecheck>) at /home/andres/src/postgresql/src/backend/executor/execScan.c:162
#27 0x0000564872050c4b in ExecSeqScan (pstate=0x564874907c18) at /home/andres/src/postgresql/src/backend/executor/nodeSeqscan.c:130
#28 0x000056487202c115 in ExecProcNode (node=0x564874907c18) at /home/andres/src/postgresql/src/include/executor/executor.h:245
#29 0x000056487202c5af in fetch_input_tuple (aggstate=0x5648749074a0) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:406
#30 0x000056487202eb03 in agg_fill_hash_table (aggstate=0x5648749074a0) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:1915
#31 0x000056487202e38c in ExecAgg (pstate=0x5648749074a0) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:1534
#32 0x0000564872020c76 in ExecProcNodeFirst (node=0x5648749074a0) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#33 0x00005648720366fc in ExecProcNode (node=0x5648749074a0) at /home/andres/src/postgresql/src/include/executor/executor.h:245
#34 0x0000564872036d83 in gather_getnext (gatherstate=0x564874907148) at /home/andres/src/postgresql/src/backend/executor/nodeGather.c:285
#35 0x0000564872036c0d in ExecGather (pstate=0x564874907148) at /home/andres/src/postgresql/src/backend/executor/nodeGather.c:216
#36 0x0000564872020c76 in ExecProcNodeFirst (node=0x564874907148) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#37 0x0000564872051ebf in ExecProcNode (node=0x564874907148) at /home/andres/src/postgresql/src/include/executor/executor.h:245
#38 0x0000564872051ffd in ExecSort (pstate=0x564874906eb0) at /home/andres/src/postgresql/src/backend/executor/nodeSort.c:107
#39 0x0000564872020c76 in ExecProcNodeFirst (node=0x564874906eb0) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#40 0x000056487202c115 in ExecProcNode (node=0x564874906eb0) at /home/andres/src/postgresql/src/include/executor/executor.h:245
#41 0x000056487202c5af in fetch_input_tuple (aggstate=0x564874906788) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:406
#42 0x000056487202e7c9 in agg_retrieve_direct (aggstate=0x564874906788) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:1729
#43 0x000056487202e3aa in ExecAgg (pstate=0x564874906788) at /home/andres/src/postgresql/src/backend/executor/nodeAgg.c:1541
#44 0x0000564872020c76 in ExecProcNodeFirst (node=0x564874906788) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:446
#45 0x000056487201793f in ExecProcNode (node=0x564874906788) at /home/andres/src/postgresql/src/include/executor/executor.h:245
#46 0x000056487201a2dd in ExecutePlan (estate=0x564874906538, planstate=0x564874906788, use_parallel_mode=1 '\001', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0x56487491f0d8,
execute_once=1 '\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:1718
#47 0x0000564872017f2b in standard_ExecutorRun (queryDesc=0x56487485e808, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:361
#48 0x0000564872017d45 in ExecutorRun (queryDesc=0x56487485e808, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at /home/andres/src/postgresql/src/backend/executor/execMain.c:304
#49 0x00005648721f98d2 in PortalRunSelect (portal=0x5648748a7b48, forward=1 '\001', count=0, dest=0x56487491f0d8) at /home/andres/src/postgresql/src/backend/tcop/pquery.c:932
#50 0x00005648721f9565 in PortalRun (portal=0x5648748a7b48, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x56487491f0d8, altdest=0x56487491f0d8, completionTag=0x7ffd4d0a73b0 "")
at /home/andres/src/postgresql/src/backend/tcop/pquery.c:773
#51 0x00005648721f3213 in exec_simple_query (
query_string=0x56487483e058 "SELECT\n\tl_returnflag,\n\tl_linestatus,\n\tsum(l_quantity) AS sum_qty,\n\tsum(l_extendedprice) AS sum_base_price,\n\tsum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,\n\tsum(l_extendedprice * (1 - l_dis"...) at /home/andres/src/postgresql/src/backend/tcop/postgres.c:1120
#52 0x00005648721f7795 in PostgresMain (argc=1, argv=0x56487486e1c8, dbname=0x56487486e030 "tpch_5", username=0x56487483a808 "andres") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4143
#53 0x00005648721528f1 in BackendRun (port=0x564874862e50) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4412
#54 0x0000564872151ff0 in BackendStartup (port=0x564874862e50) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4084
#55 0x000056487214e2c5 in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1757
#56 0x000056487214d84e in PostmasterMain (argc=41, argv=0x5648748380d0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1365
#57 0x00005648720811c5 in main (argc=41, argv=0x5648748380d0) at /home/andres/src/postgresql/src/backend/main/main.c

It's clearly not OK to recurse back into the executing query after a
failure, even less so if resources like the underlying DSM segment have
already been freed.

I suspect this is largely the fault of

commit 8561e4840c81f7e345be2df170839846814fa004
Author: Peter Eisentraut <peter_e(at)gmx(dot)net>
Date: 2018-01-22 08:30:16 -0500

Transaction control in PL procedures

which has the following hunk:

@@ -760,17 +757,6 @@ AtAbort_Portals(void)
{
Portal portal = hentry->portal;

- /*
- * See similar code in AtSubAbort_Portals(). This would fire if code
- * orchestrating multiple top-level transactions within a portal, such
- * as VACUUM, caught errors and continued under the same portal with a
- * fresh transaction. No part of core PostgreSQL functions that way.
- * XXX Such code would wish the portal to remain ACTIVE, as in
- * PreCommit_Portals().
- */
- if (portal->status == PORTAL_ACTIVE)
- MarkPortalFailed(portal);
-

which'll then, because we're not setting the portal to FAILED anymore,
trigger a full blown ExecutorFinish()/End() which is a complete no-go:

if (portal->status != PORTAL_FAILED)
{
ResourceOwner saveResourceOwner;

/* We must make the portal's resource owner current */
saveResourceOwner = CurrentResourceOwner;
if (portal->resowner)
CurrentResourceOwner = portal->resowner;

ExecutorFinish(queryDesc);
ExecutorEnd(queryDesc);
FreeQueryDesc(queryDesc);

CurrentResourceOwner = saveResourceOwner;
}

I think the comment was bad, but the functionality pretty crucial. So I
don't think
In AtAbort_Portals(), remove the code that marks an active portal as
failed. As the comment there already predicted, this doesn't work if
the running command wants to keep running after transaction abort. And
it's actually not necessary, because pquery.c is careful to run all
portal code in a PG_TRY block and explicitly runs MarkPortalFailed() if
there is an exception. So the code in AtAbort_Portals() is never used
anyway.
holds true, because FATAL errors do not follow the sigsetjmp chain.

I'm uncomfortable with the idea, but without further study, it does seem
like you might be able to largely rescue the idea by checking
proc_exit_in_progress or similar.

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2018-01-28 04:06:05 Re: Reorder C includes in partition.c
Previous Message Amit Kapila 2018-01-28 01:58:52 Re: [HACKERS] GSoC 2017: weekly progress reports (week 4) and patch for hash index