[DESIGN] ParallelAppend

From: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: [DESIGN] ParallelAppend
Date: 2015-07-26 03:13:30
Message-ID: 9A28C8860F777E439AA12E8AEA7694F80111CF22@BPXM15GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

I'm recently working/investigating on ParallelAppend feature
towards the next commit fest. Below is my design proposal.

1. Concept
----------
Its concept is quite simple anybody might consider more than once.
ParallelAppend node kicks background worker process to execute
child nodes in parallel / asynchronous.
It intends to improve the performance to scan a large partitioned
tables from standpoint of entire throughput, however, latency of
the first multi-hundred rows are not scope of this project.
From standpoint of technology trend, it primarily tries to utilize
multi-cores capability within a system, but also enables to expand
distributed database environment using foreign-tables inheritance
features.
Its behavior is very similar to Funnel node except for several
points, thus, we can reuse its infrastructure we have had long-
standing discussion through the v9.5 development cycle.

2. Problems to be solved
-------------------------
Typical OLAP workloads takes tons of tables join and scan on large
tables which are often partitioned, and its KPI is query response
time but very small number of sessions are active simultaneously.
So, we are required to run a single query as rapid as possible even
if it consumes larger computing resources than typical OLTP workloads.

Current implementation to scan heap is painful when we look at its
behavior from the standpoint - how many rows we can read within a
certain time, because of synchronous manner.
In the worst case, when SeqScan node tries to fetch the next tuple,
heap_getnext() looks up a block on shared buffer, then ReadBuffer()
calls storage manager to read the target block from the filesystem
if not on the buffer. Next, operating system makes the caller
process slept until required i/o get completed.
Most of the cases are helped in earlier stage than the above worst
case, however, the best scenario we can expect is: the next tuple
already appear on top of the message queue (of course visibility
checks are already done also) with no fall down to buffer manager
or deeper.
If we can run multiple scans in parallel / asynchronous, CPU core
shall be assigned to another process by operating system, thus,
it eventually improves the i/o density and enables higher processing
throughput.
Append node is an ideal point to be parallelized because
- child nodes can have physically different location by tablespace,
so further tuning is possible according to the system landscape.
- it can control whether subplan is actually executed on background
worker, per subplan basis. If subplan contains large tables and
small tables, ParallelAppend may kick background worker to scan
large tables only, but scan on small tables are by itself.
- Like as Funnel node, we don't need to care about enhancement of
individual node types. SeqScan, IndexScan, ForeignScan or others
can perform as usual, but actually in parallel.

3. Implementation
------------------
* Plan & Cost

ParallelAppend shall appear where Appen can appear except for the
usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
both of AppendPath and ParallelAppendPath with cost for each.
Cost estimation logic shall take further discussions, however,
I expect the logic below to estimate the cost for ParallelAppend.
1. Sum startup_cost and run_cost for each child pathnode, but
distinguish according to synchronous or asynchronous.
Probably, total cost of pathnode is less than:
(parallel_setup_cost + its total cost / parallel_append_degree
+ number of rows * cpu_tuple_comm_cost)
is nonsense to run on background worker.
2. parallel_setup_cost * (# of asynchronous nodes) are added to
sum of startup_cost of asynchronous nodes.
3. sum of run_cost of asynchronous nodes are divided by
parallel_append_degree, then cpu_tuple_comm_cost * (total # of
rows by asynchronous nodes) are added.
4. both of synchronous and asynchronous cost are added, then it
becomes the cost of ParallelAppend.
Obviously, it stand on the viewpoint that says: cost reflects response
time of the underlying plan. So, cost of ParallelAppend can be smaller
than sum of underlying child nodes.

* Execution

Like Funnel node, it kicks background worker on the ExecProcNode handler,
thus, its startup time may be later than Fujita-san's approach if call
of ParallelAppend would be late. For example, when ParallelAppend is
located under HashJoin but inner Hash loads billion of rows.
Even though I expect ExecParallelAppend takes, at least, simple round-
robin scheduling like funnel_getnext(), we may give synchronous nodes
than asynchronous just after the background worker startup.

4. Further challenges
----------------------
* Serialization of CustomScan via outfuncs.c/readfuncs.c
Because methods field is, basically, a set of pointers per process basis,
we need to have an infrastructure to reproduce same table on the background
worker process identified by the name.
(I also try to design it.)

* Duplication of the parallel
If Funnel+PartialSeqScan is located under ParallelAppend, directly
or indirectly, it eventually leads background worker process to launch
another background workers. Is it expected usage of the current background
workers??

* Join pushdown
Distribution of nested-loop and hash-join may have advantage by parallel
processing, and by reduction of hash-size if CHECK() constraint of
individual partitioned tables informs rows obviously not to be joined.
Also see the thread:
[idea] table partition + hash join: http://bit.ly/1S2xpHT
My colleague already started to investigate / develop this feature
based on existing Append, to reduce num_batches.

As an aside, my GpuJoin feature works most effectively if entire inner
relations can be loaded to hash-table on GPU RAM, so features are very
welcome.

* Sort break-down
If mergejoin tried to have ParallelAppend node on left or right input,
we may be able to compare its cost with MargeParallelAppend + Sort on
the partial relation.

* Aggregate Push Down
It is what I exactly want to do.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2015-07-26 05:34:35 Re: PL/pgSQL, RAISE and error context
Previous Message Joe Conway 2015-07-26 00:49:31 Re: markup problems in row_security GUC docs