Re: [DESIGN] ParallelAppend

From: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: [DESIGN] ParallelAppend
Date: 2015-07-28 02:29:44
Message-ID: 9A28C8860F777E439AA12E8AEA7694F80111E008@BPXM15GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Kouhei Kaigai
> Sent: Monday, July 27, 2015 11:07 PM
> To: Amit Kapila
> Cc: pgsql-hackers(at)postgresql(dot)org; Robert Haas; Kyotaro HORIGUCHI
> Subject: Re: [HACKERS] [DESIGN] ParallelAppend
>
> > On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
> > >
> > > Hello,
> > >
> > > I'm recently working/investigating on ParallelAppend feature
> > > towards the next commit fest. Below is my design proposal.
> > >
> > > 1. Concept
> > > ----------
> > > Its concept is quite simple anybody might consider more than once.
> > > ParallelAppend node kicks background worker process to execute
> > > child nodes in parallel / asynchronous.
> > > It intends to improve the performance to scan a large partitioned
> > > tables from standpoint of entire throughput, however, latency of
> > > the first multi-hundred rows are not scope of this project.
> > > From standpoint of technology trend, it primarily tries to utilize
> > > multi-cores capability within a system, but also enables to expand
> > > distributed database environment using foreign-tables inheritance
> > > features.
> > > Its behavior is very similar to Funnel node except for several
> > > points, thus, we can reuse its infrastructure we have had long-
> > > standing discussion through the v9.5 development cycle.
> > >
> > > 2. Problems to be solved
> > > -------------------------
> > > Typical OLAP workloads takes tons of tables join and scan on large
> > > tables which are often partitioned, and its KPI is query response
> > > time but very small number of sessions are active simultaneously.
> > > So, we are required to run a single query as rapid as possible even
> > > if it consumes larger computing resources than typical OLTP workloads.
> > >
> > > Current implementation to scan heap is painful when we look at its
> > > behavior from the standpoint - how many rows we can read within a
> > > certain time, because of synchronous manner.
> > > In the worst case, when SeqScan node tries to fetch the next tuple,
> > > heap_getnext() looks up a block on shared buffer, then ReadBuffer()
> > > calls storage manager to read the target block from the filesystem
> > > if not on the buffer. Next, operating system makes the caller
> > > process slept until required i/o get completed.
> > > Most of the cases are helped in earlier stage than the above worst
> > > case, however, the best scenario we can expect is: the next tuple
> > > already appear on top of the message queue (of course visibility
> > > checks are already done also) with no fall down to buffer manager
> > > or deeper.
> > > If we can run multiple scans in parallel / asynchronous, CPU core
> > > shall be assigned to another process by operating system, thus,
> > > it eventually improves the i/o density and enables higher processing
> > > throughput.
> > > Append node is an ideal point to be parallelized because
> > > - child nodes can have physically different location by tablespace,
> > > so further tuning is possible according to the system landscape.
> > > - it can control whether subplan is actually executed on background
> > > worker, per subplan basis. If subplan contains large tables and
> > > small tables, ParallelAppend may kick background worker to scan
> > > large tables only, but scan on small tables are by itself.
> > > - Like as Funnel node, we don't need to care about enhancement of
> > > individual node types. SeqScan, IndexScan, ForeignScan or others
> > > can perform as usual, but actually in parallel.
> > >
> > >
> > > 3. Implementation
> > > ------------------
> > > * Plan & Cost
> > >
> > > ParallelAppend shall appear where Appen can appear except for the
> > > usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
> > > both of AppendPath and ParallelAppendPath with cost for each.
> > >
> >
> > Is there a real need to have new node like ParallelAppendPath?
> > Can't we have Funnel node beneath AppendNode and then each
> > worker will be responsible to have SeqScan on each inherited child
> > relation. Something like
> >
> > Append
> > ---> Funnel
> > --> SeqScan rel1
> > --> SeqScan rel2
> >
> If Funnel can handle both of horizontal and vertical parallelism,
> it is a great simplification. I never stick a new node.
>
> Once Funnel get a capability to have multiple child nodes, probably,
> Append node above will have gone. I expect set_append_rel_pathlist()
> add two paths based on Append and Funnel, then planner will choose
> the cheaper one according to its cost.
>
In the latest v16 patch, Funnel is declared as follows:

typedef struct Funnel
{
Scan scan;
int num_workers;
} Funnel;

If we try to add Append capability here, I expects the structure will
be adjusted as follows, for example:

typedef struct Funnel
{
Scan scan;
List *funnel_plans;
List *funnel_num_workers;
} Funnel;

As literal, funnel_plans saves underlying Plan nodes instead of the
lefttree. Also, funnel_num_workers saves number of expected workers
to be assigned on individual child plans.

Even though create_parallelscan_paths() in v16 set num_workers not
larger than parallel_seqscan_degree, total number of the concurrent
background workers may exceed this configuration if more than two
PartialSeqScan nodes are underlying.
It is a different configuration from max_worker_processes, so it is
not a matter as long as we have another restriction.
However, how do we control the cap of number of worker processes per
"appendable" Funnel node? For example, if a parent table has 200
child tables but max_worker_processes are configured to 50.
It is obviously impossible to launch all the background workers
simultaneously. One idea I have is to suspend launch of some plans
until earlier ones are completed.

> We will need to pay attention another issues we will look at when Funnel
> kicks background worker towards asymmetric relations.
>
> If number of rows of individual child nodes are various, we may
> want to assign 10 background workers to scan rel1 with PartialSeqScan.
> On the other hands, rel2 may have very small number of rows thus
> its total_cost may be smaller than cost to launch a worker.
> In this case, Funnel has child nodes to be executed asynchronously and
> synchronously.
>
> If cheapest path of the child relation is a pair of Funnel and
> PartialSeqScan, we have to avoid to stack Funnel node. Probably,
> Funnel node that performs like Append needs to pull up underlying
> Funnel and assign equivalen number of workers as follows.
>
> Append
> --> Funnel
> --> PartialSeqScan on rel1 (num_workers = 4)
> --> Funnel
> --> PartialSeqScan on rel2 (num_workers = 8)
> --> SeqScan on rel3
>
> shall be rewritten to
> Funnel
> --> PartialSeqScan on rel1 (num_workers = 4)
> --> PartialSeqScan on rel2 (num_workers = 8)
> --> SeqScan on rel3 (num_workers = 1)
>
> We also need to consider whether Funnel will have capability
> equivalent to MergeAppend, even though parallel sorting is
> a fantastic challenge.
>
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kouhei Kaigai 2015-07-28 02:44:12 Re: CustomScan and readfuncs.c
Previous Message Joe Conway 2015-07-28 02:19:09 Re: A little RLS oversight?