Re: Inlining comparators as a performance optimisation

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <stark(at)mit(dot)edu>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Inlining comparators as a performance optimisation
Date: 2011-09-26 02:12:55
Message-ID: CAEYLb_WAmQsZU-g2auy0oWWCj2GzoxqUz4EmiUQRdS8ewh9LTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've produced something much neater than my first patch, attached,
although I still consider this to be at the POC stage, not least since
I'm not exactly sure how I should be selecting the right
specialisation in tuplesort_performsort() (a hack is used right now
that does a direct pg_proc OID comparison), and also because I haven't
implemented anything other than qsort for heap tuples yet (a minor
detail, I suppose). I'm pleased to say that a much clearer picture of
what's really going on here has emerged.

Statistics: Total runtime according to explain analyze for query
"select * from orderlines order by prod_id" (dellstore2 sample db), at
GCC 4.5's -02 optimisation level, after warming the cache, on my
desktop:

Without the patch:

~82ms

With the patch, but with the "inline" keyword commented out for all
new functions/meta-functions:

~60ms

with the patch, unmodified:

~52ms

Recent experience suggests that using a macro rather than an inline
function would perhaps have been a mistake, as it would prevent us
from benefiting from the compiler's smarts in surmising just where we
should inline. As I've pointed out, individual call sites are inlined,
not individual functions.

Tom wanted to know if I could prove the benefit in inlining, as
against just resolving the impedance mismatch without bothering with
inlining (and, indeed, further optimisations that the compiler can
avail of as a result of knowing the comparator at compile-time). These
figures support my contention that these optimisations have an
important role to play. However, there's no question that resolving
the impedance mismatch is where the biggest benefit is seen,
particularly relative to maintainability. As Tom anticipated, the
question of whether or not we do the inlining is less than entirely
straightforward, as it's less of a win, and it has a higher price in
ugliness. I myself am very much of the opinion that it's still worth
it. As I've said, sorting integers and floats is very common and very
important, and this approach will likely yield benefits beyond
increasing the speed of executor sort nodes, as described below.

I accept that a more sophisticated benchmark is required here, but
I've been a little busy actually writing the patch. Any ideas anyone?
Independent benchmarks are welcome. If someone could suggest a worst
case, that would be particularly useful, as I cannot think of one - I
believe that each instance of inlining a call site more-or-less either
is or is not a net benefit here, regardless of the number of
comparisons performed, which is why the improvement is so predictable
across the size of the set of integers sorted for by qsort in
isolation (which is the big reason why that ~8ms decrease due to
inlining turns out to be not too shabby - it's a saving per
comparison, and they can add it pretty quickly). So, for example, when
I build the patch on my laptop (GCC 4.6), with an 48MB table (the same
orderlines table query, but I doubled up the data a few times
beforehand), we see an undiminished, proportional (to the prior,
isolated cost of qsorting, I think) decrease in runtime:

Without patch:

~615ms

With patch:

~415ms

One key insight that this work brings is that resolving the impedance
mismatch - making comparisons as inexpensive as possible (including
using inlining) - is the best possible strategy in improving sort
performance today, vastly more efficacious than, for example, tweaking
the sorting algorithm. If anyone can find some more ways of shaving
cycles there, it's one place where it really matters.

Changes are isolated to the extent that if you decide that you don't
like this one day, you can simply remove the calls to qsort_arg
specialisations, and once again switch back to just using what the
patch makes a fallback, qsort_arg itself. I know that we'd never
really get into that situation, but the fact that we could do that
serves to illustrate that these changes are fairly isolated.

Incidentally, if you find writing code that is heavily dependent on
macros to be a chore, I can highly recommend Clang 2.9 .

But what of the maintenance burden of mostly duplicating qsort_arg.c,
to produce a "template" version? Well, take a look at the *entire*
history for that file:

[peter(at)localhost port]$ git log qsort_arg.c
commit 9f2e211386931f7aee48ffbc2fcaef1632d8329f
Author: Magnus Hagander <magnus(at)hagander(dot)net>
Date: Mon Sep 20 22:08:53 2010 +0200

Remove cvs keywords from all files.

commit b9954fbb4ef25fb1ea173d26017d4d128dd15be5
Author: Neil Conway <neilc(at)samurai(dot)com>
Date: Sun Mar 18 05:36:50 2007 +0000

Code cleanup for function prototypes: change two K&R-style prototypes
to ANSI-style, and change "()" -> "(void)". Patch from Stefan Huehner.

commit b38900c7677657a815e75781b776fb1e41054df3
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: Thu Oct 12 15:04:55 2006 +0000

Use Min() instead of min() in qsort, for consistency and to avoid
redefined-macro warnings on some platforms. Per gripe from Hiroshi Saito.

commit f99a569a2ee3763b4ae174e81250c95ca0fdcbb6
Author: Bruce Momjian <bruce(at)momjian(dot)us>
Date: Wed Oct 4 00:30:14 2006 +0000

pgindent run for 8.2.

commit 6edd2b4a91bda90b7f0290203bf5c88a8a8504db
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: Tue Oct 3 22:18:23 2006 +0000

Switch over to using our own qsort() all the time, as has been proposed
repeatedly. ***SNIP MESSAGE***

I think that it is fair to say that the maintenance burden imposed by
this change is well worth it. Only one of these changes after the
initial commit is not mechanical, and that is still pretty trivial.

I've attached gprof flat profile output from unmodified PostgreSQL (at
master) when we create an index on a pgbench table:

pgbench -i -s 1500 index_test

That puts the number of rows in the pgbench_accounts table at 150
million, while the table is 19 GB in size. That’s a reasonable size,
and should usefully demonstrate the likely improvement we’ll see in
the performance of creating an index.

I increased maintenance_work_mem to 756MB, plus used a fairly
straightforward postgresql.conf, plus some other, more generic values
for other GUCs. The query is:

create INDEX on pgbench_accounts(abalance); -- abalance is of type integer

Here is the top of the flat profile, by far the most important part:

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
43.56 79.29 79.29 6972025063 0.00 0.00 comparetup_index_btree
22.39 120.04 40.75 150000000 0.00 0.00 tuplesort_heap_siftup
4.36 127.97 7.93 2677057771 0.00 0.00 btint4cmp
2.88 133.21 5.24 314766350 0.00 0.00 LWLockRelease
1.81 136.50 3.29 314612660 0.00 0.00 LWLockAcquire
1.65 139.50 3.00 450905247 0.00 0.00 AllocSetAlloc
1.43 142.10 2.60 300000001 0.00 0.00 LogicalTapeWrite
1.17 144.23 2.13 comparetup_cluster
*** SNIP ***

This looks pretty top-heavy to me, and more time is spent in
comparetup_index_btree than any other function by some margin,
suggesting that it will be worthwhile to pursue similar optimisations
to make index creation faster in a later revision of this patch, or as
another patch. I didn't take care to ensure that I'd be caching the
entire table in memory, which probably would have been more useful
here.

Thoughts?

On the subject of highly ambitious optimisations to sorting, one
possibility I consider much more practicable than GPU-accelerated
sorting is simple threading; quicksort can be parallelised very
effectively, due to its divide-and-conquer nature. If we could agree
on a threading abstraction more sophisticated than forking, it's
something I'd be interested in looking at. To do so would obviously
entail lots of discussion about how that relates to whatever way we
eventually decide on implementing parallel query, and that's obviously
a difficult discussion.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment Content-Type Size
createindex.out.gz application/x-gzip 216.4 KB
0001-Added-various-optimizations-to-tuplesort-quicksort.patch text/x-patch 10.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2011-09-26 02:32:27 Re: Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)
Previous Message Tom Lane 2011-09-26 01:43:40 Re: Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)