Improving executor performance

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Improving executor performance
Date: 2016-06-24 23:29:53
Message-ID: 20160624232953.beub22r6yqux4gcp@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

at pgcon, in [1], and various other threads and conferences we talked
about our executor performance needing improvements to perform well in
!OLTP workloads, and how we can get there.

I've played with various techniques, on and off over the last few
weeks. Including trying to make slot_deform_tuple et al. faster, batched
execution for a few node types (SeqScan, hashed Agg), avoiding linked
lists in executor datastructures, and smaller micro optimizations.

As a motivation, here's a somewhat juicy example of the benefits that
can be gained (disabled parallelism, results vary too much):
SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ;
non-optimized: 9075.835 ms
optimized: 5194.024 ms

and that's far from doing all the possible work for that query;
especially the batching is far from optimal.

So far I've mostly looked at performance for tpc-h, not because it's a
particularly good workload, but because it's available. If somebody has
good suggestions for an analytics heavy workload...

My observations about the performance bottlenecks I found, and partially
addressed, are in rough order of importance (there's interdependence
between most of them):

1) Cache misses cost us a lot, doing more predictable accesses can make
a huge difference. Without addressing that, many other bottlenecks
don't matter all that much. I see *significant* performance
improvements for large seqscans (when the results are used) simply
memcpy()'ing the current target block.

This partially is an intrinsic problem of analyzing a lot of data,
and partially because our access patterns are bad.

2) Baring 1) tuple deforming is the biggest bottleneck in nearly all
queries I looked at. There's various places we trigger deforming,
most ending in either slot_deform_tuple(), heap_getattr(),
heap_deform_tuple().

This can be optimized significantly, but even after that it's still a
major issue.

Part of the problem is that we sometimes don't know how many elements
to access (especially in index and hashing related code), the other
part that we're deforming a lot of columns that we'll never need,
just because we need a subsequent one.

The other part is that our tuple format requires a number of
relatively expensive operations to access data (e.g. alignment
computations, checking the null bitmap).

3) Our 1-by-1 tuple flow in the executor has two major issues:

The biggest is that we perform a *lot* of repetitive work for each
processed tuple. E.g. looking at nodeAgg.c's agg_fill_hash_table(),
for each single tuple we execute ExecProcNode(), ExecSeqScan(),
heap_getnext(), lookup_hash_entry(), LookupTupleHashEntry(), ... and
many more. Each of these has a noticeable per-call state setup.
When executing on batches of tuples, a lot of that setup work can be
done once per batch, instead of once per tuple. Part of that the
compiler can do automatically, part of that has to be done by hand.

Also very significantly, due to the amount of code executed, there's
barely any working branch prediction, leading to massive amounts of
pipeline stalls and instruction cache misses.

There's also the fact that some future optimizations around using
SIMD are easier when looking at batches of tuples, but I'm not
planning to work on that.

4) Various missing micro optimizations have to be performed, for more
architectural issues to become visible. E.g. [2] causes such bad
slowdowns in hash-agg workloads, that other bottlenecks are hidden.

5) Per-tuple heap_getnext() makes it hard to fix 3), and has similar
issues. Using a per-block heap_getmany() that fills a batch at once
is noticeably faster (and more convenient in a batched world anyway)

6) The use of linked lists adds noticeable #instructions overhead and
branch misses. Just replacing {FuncExprState,BoolExprState}->args
with an array gives a ~10%ish boost for queries with a bunch of
quals. Another example that appears to hurts noticeably, but which
I've not experimentally fixed, is AggState->hash_needed.

To a large degree these seem fairly easily fixable; it's kinda boring
work, but not really hard.

I'm planning to start individual subthreads for some of these, with
in-detail discussions and/or patches. But I wanted to present this on a
higher level once, before spending even more time.

Have I missed concrete issues others noticed? Missed significant
improvements (please, please, only realistic stuff)?

Unfortunately these items are heavily interdependent. For some queries
you'll need issue n) addressed before n+2) shows a benefit, for some
it's the other way, and such. Given the size of the overall task, the
only way I can see this being reasonably addressable, is trying to get
smaller steps in individually, even if not a huge win on their own. And
then focus on the the larger architectural changes (primarily 3))
issues.

Comments so far? Different suggestions on how to get improvements
around this merged?

Greetings,

Andres Freund

[1] http://archives.postgresql.org/message-id/CA%2BTgmobx8su_bYtAa3DgrqB%2BR7xZG6kHRj0ccMUUshKAQVftww%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20160622002607.mbsfsm42cxqomi4d%40alap3.anarazel.de

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-06-24 23:31:50 Re: Improving executor performance
Previous Message David G. Johnston 2016-06-24 23:00:37 Re: Memory leak in Pl/Python