Re: asynchronous execution

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, robertmhaas(at)gmail(dot)com
Cc: amitdkhan(dot)pg(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: asynchronous execution
Date: 2017-02-07 04:28:42
Message-ID: 9058d70b-a6b0-8b3c-091a-fe77ed0df580@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Horiguchi-san,

On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.

With the latest set of patches, I observe a crash due to an Assert failure:

#0 0x0000003969632625 in *__GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003969633e05 in *__GI_abort () at abort.c:92
#2 0x000000000098b22c in ExceptionalCondition (conditionName=0xb30e02
"!(added)", errorType=0xb30d77 "FailedAssertion", fileName=0xb30d50
"execAsync.c",
lineNumber=345) at assert.c:54
#3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
timeout=-1) at execAsync.c:345
#4 0x0000000000687ed5 in ExecAsyncEventLoop (estate=0x13c01b8,
requestor=0x13c1640, timeout=-1) at execAsync.c:186
#5 0x00000000006a5170 in ExecAppend (node=0x13c1640) at nodeAppend.c:257
#6 0x0000000000692b9b in ExecProcNode (node=0x13c1640) at execProcnode.c:411
#7 0x00000000006bf4d7 in ExecResult (node=0x13c1170) at nodeResult.c:113
#8 0x0000000000692b5c in ExecProcNode (node=0x13c1170) at execProcnode.c:399
#9 0x00000000006a596b in fetch_input_tuple (aggstate=0x13c06a0) at
nodeAgg.c:587
#10 0x00000000006a8530 in agg_fill_hash_table (aggstate=0x13c06a0) at
nodeAgg.c:2272
#11 0x00000000006a7e76 in ExecAgg (node=0x13c06a0) at nodeAgg.c:1910
#12 0x0000000000692d69 in ExecProcNode (node=0x13c06a0) at execProcnode.c:514
#13 0x00000000006c1a42 in ExecSort (node=0x13c03d0) at nodeSort.c:103
#14 0x0000000000692d3f in ExecProcNode (node=0x13c03d0) at execProcnode.c:506
#15 0x000000000068e733 in ExecutePlan (estate=0x13c01b8,
planstate=0x13c03d0, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001',
numberTuples=0, direction=ForwardScanDirection, dest=0x7fa368ee1da8)
at execMain.c:1609
#16 0x000000000068c751 in standard_ExecutorRun (queryDesc=0x135c568,
direction=ForwardScanDirection, count=0) at execMain.c:341
#17 0x000000000068c5dc in ExecutorRun (queryDesc=0x135c568,
<snip>

I was running a query whose plan looked like:

explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
group by 1,2 order by 1;
QUERY PLAN
------------------------------------------------------
Sort
Sort Key: ((ptab.tableoid)::regclass)
-> HashAggregate
Group Key: (ptab.tableoid)::regclass, ptab.a
-> Result
-> Append
-> Foreign Scan on ptab_00001
-> Foreign Scan on ptab_00002
-> Foreign Scan on ptab_00003
-> Foreign Scan on ptab_00004
-> Foreign Scan on ptab_00005
-> Foreign Scan on ptab_00006
-> Foreign Scan on ptab_00007
-> Foreign Scan on ptab_00008
-> Foreign Scan on ptab_00009
-> Foreign Scan on ptab_00010
<snip>

The snipped part contains Foreign Scans on 90 more foreign partitions (in
fact, I could see the crash even with 10 foreign table partitions for the
same query).

There is a crash in one more case, which seems related to how WaitEventSet
objects are manipulated during resource-owner-mediated cleanup of a failed
query, such as after the FDW returned an error like below:

ERROR: relation "public.ptab_00010" does not exist
CONTEXT: Remote SQL command: SELECT a, b FROM public.ptab_00010

The backtrace in this looks like below:

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
301 lastidx = resarr->lastidx;
(gdb)
(gdb) bt
#0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
#1 0x00000000009c6578 in ResourceOwnerForgetWES
(owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
#2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
#3 0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
at resowner.c:566
#4 0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
'\001') at resowner.c:485
#5 0x0000000000524172 in AbortTransaction () at xact.c:2588
#6 0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016
#7 0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
#8 0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
#9 0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
#10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722
#11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
postmaster.c:1330
#12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228

There is a segfault when accessing the events variable, whose members seem
to be pfreed:

(gdb) f 2
#2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
600 ResourceOwnerForgetWES(set->resowner, set);
(gdb) p *set
$5 = {
nevents = 2139062143,
nevents_space = 2139062143,
resowner = 0x7f7f7f7f7f7f7f7f,
events = 0x7f7f7f7f7f7f7f7f,
latch = 0x7f7f7f7f7f7f7f7f,
latch_pos = 2139062143,
epoll_fd = 2139062143,
epoll_ret_events = 0x7f7f7f7f7f7f7f7f
}

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2017-02-07 04:43:35 Re: Parallel tuplesort (for parallel B-Tree index creation)
Previous Message Josh Soref 2017-02-07 03:53:43 Re: Possible spelling fixes