Re: asynchronous execution

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp
Cc: robertmhaas(at)gmail(dot)com, amitdkhan(dot)pg(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: asynchronous execution
Date: 2017-02-16 12:06:00
Message-ID: 20170216.210600.214980879.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you very much for testing this!

At Tue, 7 Feb 2017 13:28:42 +0900, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote in <9058d70b-a6b0-8b3c-091a-fe77ed0df580(at)lab(dot)ntt(dot)co(dot)jp>
> Horiguchi-san,
>
> On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
> > I noticed that this patch is conflicting with 665d1fa (Logical
> > replication) so I rebased this. Only executor/Makefile
> > conflicted.
>
> With the latest set of patches, I observe a crash due to an Assert failure:
>
> #3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
> timeout=-1) at execAsync.c:345

This means no pending fdw scan didn't let itself go to waiting
stage. It leads to a stuck of the whole things. This is caused if
no one acutually is waiting for result. I suppose that all of the
foreign scans ran on the same connection. Anyway it should be a
mistake in state transition. I'll look into it.

> I was running a query whose plan looked like:
>
> explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
> group by 1,2 order by 1;
> QUERY PLAN
> ------------------------------------------------------
> Sort
> Sort Key: ((ptab.tableoid)::regclass)
> -> HashAggregate
> Group Key: (ptab.tableoid)::regclass, ptab.a
> -> Result
> -> Append
> -> Foreign Scan on ptab_00001
> -> Foreign Scan on ptab_00002
> -> Foreign Scan on ptab_00003
> -> Foreign Scan on ptab_00004
> -> Foreign Scan on ptab_00005
> -> Foreign Scan on ptab_00006
> -> Foreign Scan on ptab_00007
> -> Foreign Scan on ptab_00008
> -> Foreign Scan on ptab_00009
> -> Foreign Scan on ptab_00010
> <snip>
>
> The snipped part contains Foreign Scans on 90 more foreign partitions (in
> fact, I could see the crash even with 10 foreign table partitions for the
> same query).

Yeah, it seems to me unrelated to how many they are.

> There is a crash in one more case, which seems related to how WaitEventSet
> objects are manipulated during resource-owner-mediated cleanup of a failed
> query, such as after the FDW returned an error like below:
>
> ERROR: relation "public.ptab_00010" does not exist
> CONTEXT: Remote SQL command: SELECT a, b FROM public.ptab_00010
>
> The backtrace in this looks like below:
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
> value=20645152) at resowner.c:301
> 301 lastidx = resarr->lastidx;
> (gdb)
> (gdb) bt
> #0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
> value=20645152) at resowner.c:301
> #1 0x00000000009c6578 in ResourceOwnerForgetWES
> (owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
> #2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
> #3 0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
> phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
> at resowner.c:566
> #4 0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768,
> phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
> '\001') at resowner.c:485
> #5 0x0000000000524172 in AbortTransaction () at xact.c:2588
> #6 0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016
> #7 0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
> dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
> #8 0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
> #9 0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
> #10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722
> #11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
> postmaster.c:1330
> #12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228
>
> There is a segfault when accessing the events variable, whose members seem
> to be pfreed:
>
> (gdb) f 2
> #2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
> 600 ResourceOwnerForgetWES(set->resowner, set);
> (gdb) p *set
> $5 = {
> nevents = 2139062143,
> nevents_space = 2139062143,
> resowner = 0x7f7f7f7f7f7f7f7f,
> events = 0x7f7f7f7f7f7f7f7f,
> latch = 0x7f7f7f7f7f7f7f7f,
> latch_pos = 2139062143,
> epoll_fd = 2139062143,
> epoll_ret_events = 0x7f7f7f7f7f7f7f7f
> }

Mmm, I reproduces it quite easily. A silly bug.

Something bad is happening between freeing ExecutorState memory
context and resource owner. Perhaps the ExecutorState is freed by
resowner (as a part of its anscestors) before the memory for the
WaitEventSet is freed. It was careless of me. I'll reconsider it.

Great thanks for the report.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rafia Sabih 2017-02-16 12:17:57 Re: Passing query string to workers
Previous Message Simon Riggs 2017-02-16 11:45:37 Re: Measuring replay lag