Re: Redesigning the executor (async, JIT, memory efficiency)

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: andres(at)anarazel(dot)de
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Redesigning the executor (async, JIT, memory efficiency)
Date: 2018-05-28 04:26:23
Message-ID: 20180528.132623.179362206.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks! I've been waiting for this.

At Thu, 24 May 2018 20:35:39 -0700, Andres Freund <andres(at)anarazel(dot)de> wrote in <20180525033538(dot)6ypfwcqcxce6zkjj(at)alap3(dot)anarazel(dot)de>
> The current executor structure limits us in a number of ways that are
> becoming problematic. I'll outline in this email how I think we should
> rework it. Including a prototype for the first, most drastic, steps.
>
> The primary problems with the current design that I want to address are:
>
> 1) Executor structure makes it hard to do asynchronous execution. That
> comes into play in several cases. For example we want to be able to
> continue execution if no tuples are ready in one FDW below an append
> node, instead switch to another one. Similarly that could be relevant
> for parallel queries, with multiple gather nodes.
>
> A bit further out that'd also be very useful for continuing execution
> in a different part of the tree once blocked on IO. That'd e.g. be
> useful to build several hashtables for hashjoins at once. There's
> some limitations of how we do IO for that atm however, it'd be easier
> to efficiently implement this if we had AIO support.
>
> With the current, tree-walking based architecture, it's hard to
> implement something like that, because we basically have to re-enter
> into the subtrees where we'd stopped. That'd requires adding new
> state-machine logic in every tree, including additional branches.
>
> What we basically need here is the ability to stop execution inside a
> specific executor node and have a multiplexer node (which can use
> WaitEventSets or such) decide which part of the execution tree to
> continue executing instead.

Yeah. Finally async execution requires that capability.

> 2) The tree-walking based approach makes it very hard to JIT compile the
> main portion of query execution without manually re-emitting all of
> executor/, which is obviously entirely unattractive.
>
> For the expression evaluation JIT the biggest benefit comes from
> removing the indirect branches between the expression steps and then,
> a close second, from having more efficient implementations of
> hot-path expression steps.
>
> That approach currently isn't feasible for the whole executor. The
> main sources for indirect branches / calls are ExecProcNode() and
> ExecEvalExpr() calls inside the the Exec*() nodes. To be able to make
> those constant we'd basically have to manually emit code for all of
> these (or attempt to rely on the inliner, but that doesn't work
> across deep, potentially self invoking, trees). The expression code,
> which also used to be tree walking, avoids that now by having the
> indirect function calls nearly exclusively at the top-level.

I looked into this a bit. One of most hard point in flattening
executor (in my previous attempt) *was* how to inside-out'ing the
Exec* functions, especially Agg/WindowAgg. The highlight of the
hardness was it could be utterly unreadable if *I* reconstruct,
say WindowAgg, or non-hash-Agg into functions that is called with
tuples from underlying nodes. In the repo, nodeAgg is flattened
only for the hash-agg case so it seems rather straight-forward
but I'm afraid that such complex nodes becomes collections of
mystic fractions that should be work when called in the sequence
written as a magic spell casted by the compiler part.. (Sorry in
advance for the maybe hard-to-read blob.)

One related concern is finally the "mystic fractions" are no
longer node-private but global ones called from
ExecExecProgram. Essentially JIT'ing is a kind of such and in the
current ExecExpr case, it seems to me tolerablly small. However,
I'm not sure it is maintainable (for moderate developers) if all
such fractions are connected via ExecExecProgram, following a
magic spell.

> Secondary issues that I also want to address, or at least get ready to
> address, are:
>
>
> 3) Reduce the overhead of executor startup. In a number of workloads
> involving simple queries we currently spend most of the time within
> ExecInitNode(). There's two major problems here: For one a lot of
> work is done at executor initialization that should be part of
> planning (especially creating tupledescs and nodeAgg
> initialization). The other problem is that initialization does a lot
> of tiny allocations.
>
>
> 4) Make the executor faster, leaving JIT aside.
>
> Moving to a "push based" executor can yield noticable speedups: Right
> now returning a single tuple will often pass through a lot of
> indirect function calls (via ExecProcNode()). By moving to a push
> based executor (where the first executed piece is reachable without
> any recursion), this can be significantly cheaper. We also exercise
> the stack *a lot* right now, constantly pushing/poping stuff onto it
> that's not going to be used anytime soon. Profiling shows that stack
> usage is quite a bottleneck right now.
>
>
> 5) Memory overhead of query execution.
>
> Right now executing a prepared statment has quite a massive memory
> overhead. We copy around a lot of memory, initialize a lot of nodes
> from scratch, etc. It'd be far better if we could share a lot of
> state between successive executions of the same prepared statement.
>
>
> 6) Plan ahead for vectorized / batched execution.
>
> Dealing with tuples one-by-one is inefficient from a cache locality,
> code locality and CPU pipelining perspective. Instead fully
> processing a single tuple from a page (evaluating qual, then joining
> it to other things etc), it's a lot more efficient to evaluate the
> qual for all the tuples on a page, and then pass all the matching
> tuples to the next execution step.
>
>
> After playing with / pondering about various approaches, I think the
> right way to approach this is to convert the tree-walk based execution
> with a 'program' / bytecode interpreter based approach. Quite similar to
> what we've done to expression evaluation in v10 (which then subsequently
> was used to implement JIT compilation of expression evaluation in v11).

I agree that the direction overall, but I have a concern about
maintainability as above.

> To address the memory density / startup overhead issue I think we need
> to work towards two things: First, most of the memory needed for a query
> should come from a single, accurately sized, allocation. Secondly, the
> writable memory for query execution should be referenced using memory
> offsets, rather than pointers. That way we can clone the writable memory
> for query execution.
>
>
> That's obviously quite a bit of a project. Or rather projects.
>
>
> I've started prototyping a solution for this. The code for the
> prototype is at
> https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=tree;h=refs/heads/executor-rewrite;hb=refs/heads/executor-rewrite
> (click on the repo to see the git url)
> and I plan to continue working on that project in that git branch.
>
>
> In my first iteration I'd tried to basically address 1-4 in a single
> protytype, using a single pass over the plan tree, without using the
> current executor initialization. Not impossible, but very very unwieldy.
>
> In the second version of the prototype I'm instead only tackling 1-2,
> and reuse the current executor node initialization functions. This
> prototype adds support for generating and executing an opcode based
> execution plan for a subset of query nodes (seqscan, index [only] scan,
> hash / sort based aggregation without grouping sets, limit, inner
> nestloop / hashjoin without batches) when the use_linearized_plan is set
> and all nodes in the query are supported.
>
> For example, this converts this converts TPCH's Q01:
>
> tpch_1[26142][1]=# SET client_min_messages ='log';
> tpch_1[26142][1]=# SELECT
> l_returnflag,
> l_linestatus,
> sum(l_quantity) AS sum_qty,
> sum(l_extendedprice) AS sum_base_price,
> sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
> avg(l_quantity) AS avg_qty,
> avg(l_extendedprice) AS avg_price,
> avg(l_discount) AS avg_disc,
> count(*) AS count_order
> FROM
> lineitem
> WHERE
> l_shipdate <= date '1998-12-01' - interval '74 days'
> GROUP BY
> l_returnflag,
> l_linestatus
> ORDER BY
> l_returnflag,
> l_linestatus;
> LOG: 00000: pp:
>
> into:
>
> 0: init_sort
> 1: seqscan_first
> 2: seqscan [j empty 5] > s0
> 3: qual [j fail 2] < scan s0
> 4: hashagg_tuple [j 2] < s0
> 5: drain_hashagg [j empty 7] > s1
> 6: sort_tuple [j 5] < s1
> 7: sort
> 8: drain_sort [j empty 10] > s2
> 9: return < s2 [next 8]
> 10: done
>
> where s are numbered slots, j are, potentially conditional, jumps. This
> works for a number of plans, but there's several where we currently bail
> out.
>
> The basic idea is to convert the current, largely unmodified, Plan tree
> into a "linear" representation of steps that can be executed by an
> interpreter similar to ExecInterpExpr(). Individual parts of a query
> tree form "instructions" which together form a small program to execute
> the query. Obviously such a query will contain (conditional) backward
> and forward jumps (which differentiates it from the expression case,
> which knows only forward jumps).
>
> This means that some of the current executor nodes get split into
> multiple parts, because it's not permitted to recurse from within an
> executor step.
>
> A tuple can be returned to the user / caller by a special 'return tuple'
> step. This just sets the "virtual" instruction pointer to the point
> where execution should resume, allowing execution to continue at the
> right spot when the next tuple is supposed to be returned (e.g. for
> cursors). Instead of, as currently in the prototype, returning control
> flow back to ExecutePlan() we could - and quite possibly should -
> instead just process the tuple by having a 'send_slot' step or such.
> But right now it's a good example how control flow can be interrupted at
> any boundary.
>
> For the async FDW (and gather, and IO, ...) cases, you'd have a
> 'scan_fdw' step that'd normally continue execution, but if there's no
> tuple, it'd instead jump to a multiplexer step that'd use a WaitEventSet
> to watch which of the to-be-waited-for FDWs is ready.

Finally, the executor will become push-based where all "source"
nodes conducts execution, not only by FDW nodes but all source
nodes including local scans, I suppose.

> Despite being entirely unoptimized this already yields a measurable
> speedup over the current executor for the TPCH queries it supports.
>
>
> I think the big argument against this kind of project is that it's going
> to be quite invasive. Not everything inside the executor will have to be
> touched, but it's going to be a significant portion. But I don't think
> there's really a way around that for the medium to long term future, and
> I don't see how we'd gain anything by waiting further.
>
>
> My (although I certainly hope/plan to not work on this alone) basic plan
> to attack this is:
>
> 1) Expand prototype to cover all SELECT queries. That's mostly "just
> work". This'll reuse the current executor initialization with minimal
> modifications.
>
> 2) Reimplement EXPLAIN ANALYZE support. I'm not yet 100% sure what the
> best way to do so is - potentially it's just generating slightly
> different programs, where the steps contain additional code to
> perform the instrumentation.
>
> 3) Reimplement EvalPlanQual() infrastructure by generating a separate
> execution plan for EPQ, which replaces the relevant scan nodes with
> an execution step that return tuples from estate->es_epqTuple. The
> current indirection / branches due to ExecScan() are fairly
> expensive.
>
> 4) Remove old executor *Exec() functions, and subsidiary code.
>
> 5) Clean up executor nodes, they should now require far less
> information, because a lot of the state will be in executor steps.
>
> 6) Introduce a new pass, run before ExecInitNode(), that accounts for
> all the necessary memory for, at least, nodes, slots, expression
> contexts. Then convert ExecInitNode() to allocate from that. I'd
> assume that we'd leave expression evaluation out of that, because
> such work would be a largely independent project.
>
> 7) Split state in the executor nodes / executor program steps into
> modifiable and non-modifiable parts, and store as much as reasonably
> possible using relative addresses rather than absolute pointers.
> That can then be used to make executor programs reusable and
> reentrant. That's going to be useful for at least prepared
> statement, plpgsql and better JIT code generation.
>
> 8) Write a JIT implementation.
>
>
> Thoughts?
>
>
> Regards,
>
> Andres

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message James Sewell 2018-05-28 04:36:49 Re: Undo logs
Previous Message Michael Paquier 2018-05-28 03:46:12 Re: Handling better supported channel binding types for SSL implementations