Redesigning the executor (async, JIT, memory efficiency)

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Redesigning the executor (async, JIT, memory efficiency)
Date: 2018-05-25 03:35:39
Message-ID: 20180525033538.6ypfwcqcxce6zkjj@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

The current executor structure limits us in a number of ways that are
becoming problematic. I'll outline in this email how I think we should
rework it. Including a prototype for the first, most drastic, steps.

The primary problems with the current design that I want to address are:

1) Executor structure makes it hard to do asynchronous execution. That
comes into play in several cases. For example we want to be able to
continue execution if no tuples are ready in one FDW below an append
node, instead switch to another one. Similarly that could be relevant
for parallel queries, with multiple gather nodes.

A bit further out that'd also be very useful for continuing execution
in a different part of the tree once blocked on IO. That'd e.g. be
useful to build several hashtables for hashjoins at once. There's
some limitations of how we do IO for that atm however, it'd be easier
to efficiently implement this if we had AIO support.

With the current, tree-walking based architecture, it's hard to
implement something like that, because we basically have to re-enter
into the subtrees where we'd stopped. That'd requires adding new
state-machine logic in every tree, including additional branches.

What we basically need here is the ability to stop execution inside a
specific executor node and have a multiplexer node (which can use
WaitEventSets or such) decide which part of the execution tree to
continue executing instead.

2) The tree-walking based approach makes it very hard to JIT compile the
main portion of query execution without manually re-emitting all of
executor/, which is obviously entirely unattractive.

For the expression evaluation JIT the biggest benefit comes from
removing the indirect branches between the expression steps and then,
a close second, from having more efficient implementations of
hot-path expression steps.

That approach currently isn't feasible for the whole executor. The
main sources for indirect branches / calls are ExecProcNode() and
ExecEvalExpr() calls inside the the Exec*() nodes. To be able to make
those constant we'd basically have to manually emit code for all of
these (or attempt to rely on the inliner, but that doesn't work
across deep, potentially self invoking, trees). The expression code,
which also used to be tree walking, avoids that now by having the
indirect function calls nearly exclusively at the top-level.

Secondary issues that I also want to address, or at least get ready to
address, are:

3) Reduce the overhead of executor startup. In a number of workloads
involving simple queries we currently spend most of the time within
ExecInitNode(). There's two major problems here: For one a lot of
work is done at executor initialization that should be part of
planning (especially creating tupledescs and nodeAgg
initialization). The other problem is that initialization does a lot
of tiny allocations.

4) Make the executor faster, leaving JIT aside.

Moving to a "push based" executor can yield noticable speedups: Right
now returning a single tuple will often pass through a lot of
indirect function calls (via ExecProcNode()). By moving to a push
based executor (where the first executed piece is reachable without
any recursion), this can be significantly cheaper. We also exercise
the stack *a lot* right now, constantly pushing/poping stuff onto it
that's not going to be used anytime soon. Profiling shows that stack
usage is quite a bottleneck right now.

5) Memory overhead of query execution.

Right now executing a prepared statment has quite a massive memory
overhead. We copy around a lot of memory, initialize a lot of nodes
from scratch, etc. It'd be far better if we could share a lot of
state between successive executions of the same prepared statement.

6) Plan ahead for vectorized / batched execution.

Dealing with tuples one-by-one is inefficient from a cache locality,
code locality and CPU pipelining perspective. Instead fully
processing a single tuple from a page (evaluating qual, then joining
it to other things etc), it's a lot more efficient to evaluate the
qual for all the tuples on a page, and then pass all the matching
tuples to the next execution step.

After playing with / pondering about various approaches, I think the
right way to approach this is to convert the tree-walk based execution
with a 'program' / bytecode interpreter based approach. Quite similar to
what we've done to expression evaluation in v10 (which then subsequently
was used to implement JIT compilation of expression evaluation in v11).

To address the memory density / startup overhead issue I think we need
to work towards two things: First, most of the memory needed for a query
should come from a single, accurately sized, allocation. Secondly, the
writable memory for query execution should be referenced using memory
offsets, rather than pointers. That way we can clone the writable memory
for query execution.

That's obviously quite a bit of a project. Or rather projects.

I've started prototyping a solution for this. The code for the
prototype is at
https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=tree;h=refs/heads/executor-rewrite;hb=refs/heads/executor-rewrite
(click on the repo to see the git url)
and I plan to continue working on that project in that git branch.

In my first iteration I'd tried to basically address 1-4 in a single
protytype, using a single pass over the plan tree, without using the
current executor initialization. Not impossible, but very very unwieldy.

In the second version of the prototype I'm instead only tackling 1-2,
and reuse the current executor node initialization functions. This
prototype adds support for generating and executing an opcode based
execution plan for a subset of query nodes (seqscan, index [only] scan,
hash / sort based aggregation without grouping sets, limit, inner
nestloop / hashjoin without batches) when the use_linearized_plan is set
and all nodes in the query are supported.

For example, this converts this converts TPCH's Q01:

tpch_1[26142][1]=# SET client_min_messages ='log';
tpch_1[26142][1]=# SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) AS sum_qty,
sum(l_extendedprice) AS sum_base_price,
sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
avg(l_quantity) AS avg_qty,
avg(l_extendedprice) AS avg_price,
avg(l_discount) AS avg_disc,
count(*) AS count_order
FROM
lineitem
WHERE
l_shipdate <= date '1998-12-01' - interval '74 days'
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
LOG: 00000: pp:

into:

0: init_sort
1: seqscan_first
2: seqscan [j empty 5] > s0
3: qual [j fail 2] < scan s0
4: hashagg_tuple [j 2] < s0
5: drain_hashagg [j empty 7] > s1
6: sort_tuple [j 5] < s1
7: sort
8: drain_sort [j empty 10] > s2
9: return < s2 [next 8]
10: done

where s are numbered slots, j are, potentially conditional, jumps. This
works for a number of plans, but there's several where we currently bail
out.

The basic idea is to convert the current, largely unmodified, Plan tree
into a "linear" representation of steps that can be executed by an
interpreter similar to ExecInterpExpr(). Individual parts of a query
tree form "instructions" which together form a small program to execute
the query. Obviously such a query will contain (conditional) backward
and forward jumps (which differentiates it from the expression case,
which knows only forward jumps).

This means that some of the current executor nodes get split into
multiple parts, because it's not permitted to recurse from within an
executor step.

A tuple can be returned to the user / caller by a special 'return tuple'
step. This just sets the "virtual" instruction pointer to the point
where execution should resume, allowing execution to continue at the
right spot when the next tuple is supposed to be returned (e.g. for
cursors). Instead of, as currently in the prototype, returning control
flow back to ExecutePlan() we could - and quite possibly should -
instead just process the tuple by having a 'send_slot' step or such.
But right now it's a good example how control flow can be interrupted at
any boundary.

For the async FDW (and gather, and IO, ...) cases, you'd have a
'scan_fdw' step that'd normally continue execution, but if there's no
tuple, it'd instead jump to a multiplexer step that'd use a WaitEventSet
to watch which of the to-be-waited-for FDWs is ready.

Despite being entirely unoptimized this already yields a measurable
speedup over the current executor for the TPCH queries it supports.

I think the big argument against this kind of project is that it's going
to be quite invasive. Not everything inside the executor will have to be
touched, but it's going to be a significant portion. But I don't think
there's really a way around that for the medium to long term future, and
I don't see how we'd gain anything by waiting further.

My (although I certainly hope/plan to not work on this alone) basic plan
to attack this is:

1) Expand prototype to cover all SELECT queries. That's mostly "just
work". This'll reuse the current executor initialization with minimal
modifications.

2) Reimplement EXPLAIN ANALYZE support. I'm not yet 100% sure what the
best way to do so is - potentially it's just generating slightly
different programs, where the steps contain additional code to
perform the instrumentation.

3) Reimplement EvalPlanQual() infrastructure by generating a separate
execution plan for EPQ, which replaces the relevant scan nodes with
an execution step that return tuples from estate->es_epqTuple. The
current indirection / branches due to ExecScan() are fairly
expensive.

4) Remove old executor *Exec() functions, and subsidiary code.

5) Clean up executor nodes, they should now require far less
information, because a lot of the state will be in executor steps.

6) Introduce a new pass, run before ExecInitNode(), that accounts for
all the necessary memory for, at least, nodes, slots, expression
contexts. Then convert ExecInitNode() to allocate from that. I'd
assume that we'd leave expression evaluation out of that, because
such work would be a largely independent project.

7) Split state in the executor nodes / executor program steps into
modifiable and non-modifiable parts, and store as much as reasonably
possible using relative addresses rather than absolute pointers.
That can then be used to make executor programs reusable and
reentrant. That's going to be useful for at least prepared
statement, plpgsql and better JIT code generation.

8) Write a JIT implementation.

Thoughts?

Regards,

Andres

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2018-05-25 04:19:58 Re: Fix some error handling for read() and errno
Previous Message Tom Lane 2018-05-25 03:19:02 Re: Keeping temporary tables in shared buffers