Re: parallel append vs. simple UNION ALL

From: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: parallel append vs. simple UNION ALL
Date: 2018-03-07 10:36:02
Message-ID: CAKcux6knXga2PHgL2Apr3jQ1ooe=4Thvv3fLW0rgi1cWg60Q0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

With 0001 applied on PG-head, I got reference leak warning and later a
server crash.
this crash is reproducible with enable_parallel_append=off also.
below is the test case to reproduce this.

SET enable_parallel_append=off;
SET parallel_setup_cost=0;
SET parallel_tuple_cost=0;
SET min_parallel_table_scan_size=0;

CREATE TABLE foo (a numeric PRIMARY KEY);
INSERT INTO foo VALUES (1);
INSERT INTO foo VALUES (2);
ANALYZE foo;

CREATE TABLE bar (a numeric PRIMARY KEY);
INSERT INTO bar VALUES (6);
INSERT INTO bar VALUES (7);
ANALYZE bar;

Select a from foo where a=(select * from foo where a=1)
UNION All
SELECT a FROM bar where a=(select * from bar where a=1);

/*
postgres=# Select a from foo where a=(select * from foo where a=1)
UNION All
SELECT a FROM bar where a=(select * from bar where a=1);
WARNING: relcache reference leak: relation "foo" not closed
WARNING: relcache reference leak: relation "bar" not closed
WARNING: Snapshot reference leak: Snapshot 0x22f9168 still referenced
WARNING: Snapshot reference leak: Snapshot 0x22f9298 still referenced
a
---
1
(1 row)

postgres=# Select a from foo where a=(select * from foo where a=1)
UNION All
SELECT a FROM bar where a=(select * from bar where a=1);
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>
*/

Stack-trace for the crash is given below :

/*
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `postgres: parallel worker for PID
7574 '.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000712283 in ExecProcNode (node=0x0) at
../../../src/include/executor/executor.h:236
236 if (node->chgParam != NULL) /* something changed? */
Missing separate debuginfos, use: debuginfo-install
keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
libcom_err-1.41.12-23.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000000000712283 in ExecProcNode (node=0x0) at
../../../src/include/executor/executor.h:236
#1 0x0000000000714304 in ExecSetParamPlan (node=0x2311dd0,
econtext=0x2312660) at nodeSubplan.c:1056
#2 0x00000000006ce19f in ExecEvalParamExec (state=0x231ab10, op=0x231ac28,
econtext=0x2312660) at execExprInterp.c:2225
#3 0x00000000006cc106 in ExecInterpExpr (state=0x231ab10,
econtext=0x2312660, isnull=0x7ffddfed6e47 "") at execExprInterp.c:1024
#4 0x00000000006cd828 in ExecInterpExprStillValid (state=0x231ab10,
econtext=0x2312660, isNull=0x7ffddfed6e47 "") at execExprInterp.c:1819
#5 0x00000000006e02a1 in ExecEvalExprSwitchContext (state=0x231ab10,
econtext=0x2312660, isNull=0x7ffddfed6e47 "") at
../../../src/include/executor/executor.h:305
#6 0x00000000006e038e in ExecQual (state=0x231ab10, econtext=0x2312660) at
../../../src/include/executor/executor.h:374
#7 0x00000000006e066b in ExecScan (node=0x2311cb8, accessMtd=0x70e4dc
<SeqNext>, recheckMtd=0x70e5b3 <SeqRecheck>) at execScan.c:190
#8 0x000000000070e600 in ExecSeqScan (pstate=0x2311cb8) at
nodeSeqscan.c:129
#9 0x00000000006deac2 in ExecProcNodeFirst (node=0x2311cb8) at
execProcnode.c:446
#10 0x00000000006e9219 in ExecProcNode (node=0x2311cb8) at
../../../src/include/executor/executor.h:239
#11 0x00000000006e94a1 in ExecAppend (pstate=0x23117a8) at nodeAppend.c:203
#12 0x00000000006deac2 in ExecProcNodeFirst (node=0x23117a8) at
execProcnode.c:446
#13 0x00000000006d5469 in ExecProcNode (node=0x23117a8) at
../../../src/include/executor/executor.h:239
#14 0x00000000006d7dc2 in ExecutePlan (estate=0x2310e08,
planstate=0x23117a8, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001', numberTuples=0,
direction=ForwardScanDirection, dest=0x22ea108, execute_once=1 '\001')
at execMain.c:1721
#15 0x00000000006d5a3b in standard_ExecutorRun (queryDesc=0x22ff1d8,
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at
execMain.c:361
#16 0x00000000006d5857 in ExecutorRun (queryDesc=0x22ff1d8,
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at
execMain.c:304
#17 0x00000000006dcb39 in ParallelQueryMain (seg=0x22561d0,
toc=0x7f49097c0000) at execParallel.c:1313
#18 0x0000000000535cdb in ParallelWorkerMain (main_arg=737000409) at
parallel.c:1397
#19 0x00000000007fcb8d in StartBackgroundWorker () at bgworker.c:841
#20 0x000000000080ffb1 in do_start_bgworker (rw=0x227cea0) at
postmaster.c:5741
#21 0x000000000081031d in maybe_start_bgworkers () at postmaster.c:5954
#22 0x000000000080f382 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5134
#23 <signal handler called>
#24 0x0000003dd26e1603 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:82
#25 0x000000000080ab76 in ServerLoop () at postmaster.c:1721
#26 0x000000000080a365 in PostmasterMain (argc=3, argv=0x2254180) at
postmaster.c:1365
#27 0x000000000073f0b0 in main (argc=3, argv=0x2254180) at main.c:228
*/

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

On Tue, Mar 6, 2018 at 12:27 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Feb 27, 2018 at 6:21 AM, Rajkumar Raghuwanshi
> <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
> > I have applied 0001 on pg-head, and while playing with regression tests,
> it
> > crashed with below test case.
> >
> > postgres=# SET min_parallel_table_scan_size=0;
> > SET
> > postgres=# SELECT * FROM information_schema.foreign_data_wrapper_options
> > ORDER BY 1, 2, 3;
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> > The connection to the server was lost. Attempting reset: Failed.
>
> Hmm, nice catch. I think part of the problem here is that commit
> 69f4b9c85f168ae006929eec44fc44d569e846b9, wherein Andres introduced
> the ProjectSet node, didn't really do the right thing in terms of
> testing parallel-safety. Before that commit, is_parallel_safe(root,
> (Node *) scanjoin_target->exprs)) was really testing whether the tlist
> produced during the scan/join phase was parallel-safe. However, after
> that commit, scanjoin_target->exprs wasn't really the final target for
> the scan/join phase any more; instead, it was the first of possibly
> several targets computed by split_pathtarget_at_srfs(). Really, the
> right thing to do is to test the *last* element in that list for
> parallel-safety, but as the code stands we end up testing the *first*
> element. So, if there's a parallel-restricted item in the target list
> (this query ends up putting a CoerceToDomain in the target list, which
> we currently consider parallel-restricted), it looks we can
> nevertheless end up trying to project it in what is supposed to be a
> partial path.
>
> There are a large number of cases where this doesn't end up mattering
> because the next upper_rel created will not get marked
> consider_parallel because its target list will also contain the same
> parallel-restricted construct, and therefore the partial paths
> generated for the scan/join rel will never get used -- except to
> generate Gather/Gather Merge paths, which has already happened; but
> that step didn't know about the rest of the scan/join targets either,
> so it won't have used them. However, both create_distinct_paths() and
> the code in grouping_planner that populates final_rel assume that they
> don't need to retest the target for parallel-safety because no
> projection is done at those levels; they just inherit the
> parallel-safety marking of the input rel, so in those cases if the
> input rel's marking is wrong the result is populated upward.
>
> There's another way final_rel->consider_parallel can be wrong, too: if
> the FROM-list is empty, then we create a join rel and set its
> consider_parallel flag without regard to the parallel-safety of the
> target list. There are comments in query_planner() says that this
> will be dealt with "later", but this seems not to be true. (Before
> Tom's commit da1c91631e3577ea5818f855ebb5bd206d559006, the comments
> simply ignored the question of whether a check was needed here, but
> Tom seems to have inserted an incorrect justification for the
> already-wrong code.)
>
> I'm not sure to what degree, if at all, any of these problems are
> visible given that we don't use final_rel->consider_parallel for much
> of anything. Certainly, it gets much easier to trigger a problem with
> 0001 applied, as the test case shows. But I'm not entirely convinced
> that there's no problem even without that. It seems like every upper
> rel that is setting its consider_parallel flag based on the first
> element of some list of targets rather than the last is potentially
> vulnerable to ending up with the wrong answer, and I'm afraid that
> might have some adverse consequence that I haven't quite pinned down
> yet.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-03-07 10:43:16 Re: [HACKERS] GSoC 2017: Foreign Key Arrays
Previous Message Alvaro Herrera 2018-03-07 10:10:34 Re: Some message fixes