JIT performance bug/regression & JIT EXPLAIN

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: JIT performance bug/regression & JIT EXPLAIN
Date: 2019-09-27 07:20:53
Message-ID: 20190927072053.njf6prdl3vb7y7qb@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Unfortunately I found a performance regression for JITed query
compilation introduced in 12, compared to 11. Fixed in one of the
attached patches (v1-0009-Fix-determination-when-tuple-deforming-can-be-JIT.patch
- which needs a better commit message).

The first question is when to push that fix. I'm inclined to just do so
now - as we still do JITed tuple deforming in most cases, as well as
doing so in 11 in the places this patch fixes, the risk of that seems
low. But I can also see an arguments for waiting after 12.0.

For me the bigger question is about how to make sure we can write tests
determining which parts of the querytree are JIT compiled and which are
not. There's the above bug, and I'm also hunting a regression introduced
somewhere during 11's lifetime, which suggests to me that we need better
coverage. I also want to add new JIT logic, making this even more
important.

The reason that 11 didn't have tests verifying that certain parts of the
plan tree are JIT compiled is that EXPLAIN doesn't currently show the
relevant information, and it's not that trivial to do so.

What I'd like to do is to add additional, presumably optional, output to
EXPLAIN showing additional information about expressions.

There's two major parts to doing so:

1) Find a way to represent the additional information attached to
expressions, and provide show_expression et al with the ExprState to
be able to do so. The additional information I think is necessary is
a) is expression jit compiled
b-d) is scan/outer/inner tuple deforming necessary, and if so, JIT
compiled.

We can't unconditionally JIT compile for tuple deforming, because
there's a number of cases where the source slot doesn't have
precisely the same tuple desc, and/or doesn't have the same type.

2) Expand EXPLAIN output to show expressions that currently aren't
shown. Performance-wise the ones most critical that aren't currently
visible, and that I know about, are:
- Agg's combined transition function, we also currently don't display
in any understandable way how many passes over the input we do (for
grouping sets), nor how much memory is needed.
- Agg's hash comparator (separate regression referenced above)
- Hash/HashJoin's hashkeys/hjclauses

For 1) think we need to change show_expression()/show_qual() etc to also
pass down the corresponding ExprState if available (not available in
plenty of cases, most of which are not particularly important). That's
fairly mechanical.

Then we need to add information about JIT to individual expressions. In
the attached WIP patchset I've made that dependent on the new
"jit_details" EXPLAIN option. When specified, new per-expression
information is shown:
- JIT-Expr: whether the expression was JIT compiled (might e.g. not be
the case because no parent was provided)
- JIT-Deform-{Scan,Outer,Inner}: wether necessary, and whether JIT accelerated.

I don't like these names much, but ...

For the deform cases I chose to display
a) the function name if JIT compiled
b) "false" if the expression is JIT compiled, deforming is
necessary, but deforming is not JIT compiled (e.g. because the slot type
wasn't fixed)
c) "null" if not necessary, with that being omitted in text mode.

So e.g in json format this looks like:

"Filter": {
"Expr": "(lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone)",
"JIT-Expr": "evalexpr_0_2",
"JIT-Deform-Scan": "deform_0_3",
"JIT-Deform-Outer": null,
"JIT-Deform-Inner": null
}
and in text mode:

Filter: (lineitem.l_shipdate <= '1998-09-18 00:00:00'::timestamp without time zone); JIT-Expr: evalexpr_0_2, JIT-Deform-Scan: deform_0_3

For now I chose to make Filter a group when both, not in text mode and
jit_details on - otherwise it's unclear what the JIT fields would apply
to. But that's pretty crappy, because it means that the 'shape' of the
output depends on the jit_details option. I think if we were starting
from scratch it'd make sense to alway have the Expression as it's own
sub-node, so interpreting code doesn't have to know all the places an
expression can be referenced from. But it's probably not too attractive
to change that today?

Somewhat independently the series also contains a patch that renames
verbose mode's "Output" to project if the node projects. I find it
pretty hard to interpret whether a node projects otherwise, and it's
confusing when jit_details shows details only for some node's Output,
but not for others. But the compat break due to that change is not small
- perhaps we could instead mark that in another way?

For 2) I've only started to improve the situation, but it's a pretty
number of pretty crucial pieces.

I first focussed adding information for Agg nodes, as a) those are
typically performance sensitive in cases where JIT is beneficial b) the
current instrumentation is really insufficient, especially in cases
where multiple grouping sets are computed at the same time - I think
it's effectilvey not interpretable.

In verbose mode explain now shows per-phase output about the transition
computation. E.g. for a grouping set query that can't be computed in one
pass, it now displays something like

MixedAggregate (cost=6083420.07..14022888.98 rows=10011685 width=64)
Project: avg((l_linenumber)::bigint), count((l_partkey)::bigint), sum(l_quantity), l_linenumber, l_partkey, l_quantity
Filter: (sum(lineitem.l_quantity) IS NOT NULL)
Phase 2 using strategy "Sort":
Sort Key: lineitem.l_partkey, lineitem.l_quantity
Transition Function: 2 * int8_avg_accum(TRANS, (l_linenumber)::bigint), 2 * int8inc_any(TRANS, (l_partkey)::bigint), 2 * float8pl(TRANS, l_quantity)
Sorted Group: lineitem.l_partkey, lineitem.l_quantity
Sorted Group: lineitem.l_partkey
Phase 1 using strategy "Sorted Input & All & Hash":
Transition Function: 6 * int8_avg_accum(TRANS, (l_linenumber)::bigint), 6 * int8inc_any(TRANS, (l_partkey)::bigint), 6 * float8pl(TRANS, l_quantity)
Sorted Input Group: lineitem.l_linenumber, lineitem.l_partkey, lineitem.l_quantity
Sorted Input Group: lineitem.l_linenumber, lineitem.l_partkey
Sorted Input Group: lineitem.l_linenumber
All Group
Hash Group: lineitem.l_quantity
Hash Group: lineitem.l_quantity, lineitem.l_linenumber
-> Sort (cost=6083420.07..6158418.50 rows=29999372 width=16)
...

The N * indicates how many of the same transition functions are computed
during that phase.

I'm not sure that 'TRANS' is the best placeholder for the transition
value here. Maybe $TRANS would be clearer?

For a parallel aggregate the upper level looks like:

Finalize HashAggregate (cost=610681.93..610682.02 rows=9 width=16)
Project: l_tax, sum(l_quantity)
Phase 0 using strategy "Hash":
Transition Function: float8pl(TRANS, (PARTIAL sum(l_quantity)))
Hash Group: lineitem.l_tax
-> Gather (cost=610677.11..610681.70 rows=45 width=16)
Output: l_tax, (PARTIAL sum(l_quantity))
Workers Planned: 5
-> Partial HashAggregate (cost=609677.11..609677.20 rows=9 width=16)
Project: l_tax, PARTIAL sum(l_quantity)

I've not done that yet, but I think it's way past time that we also add
memory usage information to Aggregate nodes (both for the hashtable(s),
and for internal sorts if those are performed for grouping sets). Which
would also be very hard in the "current" format, as there's no
representation of passes.

With jit_details enabled, we then can show information about the
aggregation function, and grouping functions:
Phase 0 using strategy "Hash":
Transition Function: float8pl(TRANS, (PARTIAL sum(l_quantity))); JIT-Expr: evalexpr_0_11, JIT-Deform-Outer: false
Hash Group: lineitem.l_tax; JIT-Expr: evalexpr_0_8, JIT-Deform-Outer: deform_0_10, JIT-Deform-Inner: deform_0_9

Currently the "new" format is used when either grouping sets are in use
(as the previous explain output was not particularly useful, and
information about the passes is important), or if VERBOSE or JIT_DETAILS
are specified.

For HashJoin/Hash I've added 'Outer Hash Key' and 'Hash Key' for each
key, but only in verbose mode. That's somewhat important because for
HashJoins those currently are often the performance critical bit,
because they'll commonly be the expressions that deform the slots from
below. That display is somewhat redundant with HashJoins "Hash Cond",
but they're evaluated separately. Under verbose that seems OK to me.

With jit_details enabled, this e.g. looks like this:

Hash Join (cost=271409.60..2326739.51 rows=30000584 width=250)
Project: lineitem.l_orderkey, lineitem.l_partkey, lineitem.l_suppkey, lineitem.l_linenumber, lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount, lineitem.l_tax,
Inner Unique: true
Hash Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)); JIT-Expr: evalexpr_0_7, JIT-Deform-Outer: deform_0_9, JIT-Deform-Inner:
Outer Hash Key: lineitem.l_partkey; JIT-Expr: evalexpr_0_10, JIT-Deform-Outer: deform_0_11
Outer Hash Key: lineitem.l_suppkey; JIT-Expr: evalexpr_0_12, JIT-Deform-Outer: deform_0_13
-> Seq Scan on public.lineitem (cost=0.00..819684.84 rows=30000584 width=106)
Output: lineitem.l_orderkey, lineitem.l_partkey, lineitem.l_suppkey, lineitem.l_linenumber, lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount, lineitem.l
-> Hash (cost=129384.24..129384.24 rows=3999824 width=144)
Output: partsupp.ps_partkey, partsupp.ps_suppkey, partsupp.ps_availqty, partsupp.ps_supplycost, partsupp.ps_comment
Hash Key: partsupp.ps_partkey; JIT-Expr: evalexpr_0_0, JIT-Deform-Outer: deform_0_1
Hash Key: partsupp.ps_suppkey; JIT-Expr: evalexpr_0_2, JIT-Deform-Outer: deform_0_3
-> Seq Scan on public.partsupp (cost=0.00..129384.24 rows=3999824 width=144)
Output: partsupp.ps_partkey, partsupp.ps_suppkey, partsupp.ps_availqty, partsupp.ps_supplycost, partsupp.ps_comment
JIT:
Functions: 14 (6 for expression evaluation, 8 for tuple deforming)
Options: Inlining true, Optimization true, Expressions true, Deforming true

this also highlights the sad fact that we currently use a separate
ExprState to compute each of the hash keys, and then "manually" invoke
the hash function itself. That's bad both for interpreted execution, as
we repeatedly pay executor startup overhead and don't even hit the
fastpath, as well as for JITed execution, because we have more code to
optimize (some of it pretty redundant, in particular the deforming). In
both cases we suffer from the problem that we deform the tuple
incrementally.

A later patch in the series then uses the new explain output to add some
tests for JIT, and then fixes two bugs, showing that the test output
changes.

Additionally I've also included a small improvement to the expression
evaluation logic, which also changes output in the JIT test, as it
should.

Comments?

Greetings,

Andres Freund

Attachment Content-Type Size
v1-0001-jit-Instrument-function-purpose-separately-add-tr.patch text/x-diff 5.4 KB
v1-0002-Refactor-explain.c-to-pass-ExprState-down-to-show.patch text/x-diff 14.3 KB
v1-0003-Explain-Differentiate-between-a-node-projecting-o.patch text/x-diff 101.9 KB
v1-0004-Add-EXPLAIN-option-jit_details-showing-per-expres.patch text/x-diff 10.9 KB
v1-0005-jit-explain-remove-backend-lifetime-module-count-.patch text/x-diff 6.9 KB
v1-0006-WIP-explain-Show-per-phase-information-about-aggr.patch text/x-diff 74.0 KB
v1-0007-WIP-explain-Output-hash-keys-in-verbose-mode.patch text/x-diff 22.0 KB
v1-0008-jit-Add-tests.patch text/x-diff 27.9 KB
v1-0009-Fix-determination-when-tuple-deforming-can-be-JIT.patch text/x-diff 6.5 KB
v1-0010-jit-Fix-pessimization-of-execGrouping.c-compariso.patch text/x-diff 4.0 KB
v1-0011-Reduce-code-duplication-for-ExecJust-Var-operatio.patch text/x-diff 6.9 KB
v1-0012-Don-t-generate-EEOP_-_FETCHSOME-operations-for-sl.patch text/x-diff 12.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2019-09-27 07:20:55 Re: Unstable select_parallel regression output in 12rc1
Previous Message Michael Paquier 2019-09-27 07:14:01 Re: Refactoring of connection with password prompt loop for frontends