Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Doug Doole <ddoole(at)salesforce(dot)com>
Subject: Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)
Date: 2016-12-06 23:22:31
Message-ID: 20161206232231.ajn6r5bww63v4ntu@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016-12-06 13:27:14 -0800, Peter Geoghegan wrote:
> On Mon, Dec 5, 2016 at 7:49 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I tried to address 2) by changing the C implementation. That brings some
> > measurable speedups, but it's not huge. A bigger speedup is making
> > slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers;
> > but it's still not huge. Finally I turned to just-in-time (JIT)
> > compiling the code for tuple deforming. That doesn't save the cost of
> > 1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01). The
> > first part is done in 0008, the JITing in 0012.
>
> A more complete motivating example would be nice. For example, it
> would be nice to see the overall speedup for some particular TPC-H
> query.

Well, it's a bit WIP-y for that - not all TPCH queries run JITed yet, as
I've not done that for enough expression types... And you run quickly
into other bottlenecks.

But here we go for TPCH (scale 10) Q01:
master:
Time: 33885.381 ms
16.29% postgres postgres [.] slot_getattr
12.85% postgres postgres [.] ExecMakeFunctionResultNoSets
10.85% postgres postgres [.] advance_aggregates
6.91% postgres postgres [.] slot_deform_tuple
6.70% postgres postgres [.] advance_transition_function
4.59% postgres postgres [.] ExecProject
4.25% postgres postgres [.] float8_accum
3.69% postgres postgres [.] tuplehash_insert
2.39% postgres postgres [.] float8pl
2.20% postgres postgres [.] bpchareq
2.03% postgres postgres [.] check_stack_depth

profile:

(note that all expression evaluated things are distributed among many
functions)

dev (no jiting):
Time: 30343.532 ms

profile:
16.57% postgres postgres [.] slot_deform_tuple
13.39% postgres postgres [.] ExecEvalExpr
8.64% postgres postgres [.] advance_aggregates
8.58% postgres postgres [.] advance_transition_function
5.83% postgres postgres [.] float8_accum
5.14% postgres postgres [.] tuplehash_insert
3.89% postgres postgres [.] float8pl
3.60% postgres postgres [.] slot_getattr
2.66% postgres postgres [.] bpchareq
2.56% postgres postgres [.] heap_getnext

dev (jiting):
SET jit_tuple_deforming = on;
SET jit_expressions = true;

Time: 24439.803 ms

profile:
11.11% postgres postgres [.] slot_deform_tuple
10.87% postgres postgres [.] advance_aggregates
9.74% postgres postgres [.] advance_transition_function
6.53% postgres postgres [.] float8_accum
5.25% postgres postgres [.] tuplehash_insert
4.31% postgres perf-10698.map [.] deform0
3.68% postgres perf-10698.map [.] evalexpr6
3.53% postgres postgres [.] slot_getattr
3.41% postgres postgres [.] float8pl
2.84% postgres postgres [.] bpchareq

(note how expression eval when from 13.39% to roughly 4%)

The slot_deform_cost here is primarily cache misses. If you do the
"memory order" iteration, it drops significantly.

The JIT generated code still leaves a lot on the table, i.e. this is
definitely not the best we can do. We also deform half the tuple twice,
because I've not yet added support for starting to deform in the middle
of a tuple.

Independent of new expression evaluation and/or JITing, if you make
advance_aggregates and advance_transition_function inline functions (or
you do profiling accounting for children), you'll notice that ExecAgg()
+ advance_aggregates + advance_transition_function themselves take up
about 20% cpu-time. That's *not* including the hashtable management,
the actual transition functions, and such themselves.

If you have queries where tuple deforming is a bigger proportion of the
load, or where expression evalution (including projection) is a larger
part (any NULLs e.g.) you can get a lot bigger wins, even without
actually optimizing the generated code (which I've not yet done).

Just btw: float8_accum really should use an internal aggregation type
instead of using postgres array...

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message legrand legrand 2016-12-06 23:29:53 Partitionning: support for Truncate Table WHERE
Previous Message Gilles Darold 2016-12-06 23:11:49 Re: Patch to implement pg_current_logfile() function