Re: Inlining comparators as a performance optimisation

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Inlining comparators as a performance optimisation
Date: 2011-11-18 21:38:18
Message-ID: CAEYLb_WdiO+k0DAg9QTMXp_37ibYZLHyzOY0xA+Z73BN4X_spQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18 November 2011 05:20, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think that we should really consider doing with this patch what Tom
> suggested upthread; namely, looking for a mechanism to allow
> individual datatypes to offer up a comparator function that doesn't
> require bouncing through FunctionCall2Coll().  It seems to me that
> duplicating the entire qsort() algorithm is a little iffy.

I understand that we highly value extensibility and genericity (yes,
that's a real word). We may not always be well served by that
tendency.

> Sure, in this case it works out to a win.  But it's only a small win -
> three-quarters of it is in the uncontroversial activity of reducing
> the impedance mismatch - and how do we know it will always be a win?

Firstly, 1/4 of a quite large gain is still a pretty good gain.
Secondly, I probably didn't actually isolate the effects of inlining,
nor the overall benefit of the compiler knowing the comparator at
compile-time. I just removed the inline keyword. Those are two
different things. The inline keyword serves as a request to the
compiler to inline. The compiler can and often will ignore that
request. Most people know that. What isn't so widely known is that
modern compilers may equally well inline even when they haven't been
asked to (but only when they can). When you also consider, as I've
already pointed out several times, that individual call sites are
inlined, it becomes apparent that there may well still be a certain
amount of inlining and/or other optimisations like procedural
integration going on at some call sites even without the encouragement
of the inline keyword, that would not have been performed without the
benefit of compile-time comparator knowledge. The addition of the
inline keyword may just, in this particular case, have the compiler
inline even more call sites. Posting the ~8ms difference was
motivated by a desire to prove that inlining had *some* role to play,
without actually going to the trouble of implementing Tom's idea as a
basis of comparison, because Tom was very sceptical of inlining.

The long and the short of it is that I'm going to have to get my hands
dirty with a dissembler before we really know exactly what's
happening. That, or I could use an optimisation fence of some type.

> Adding more copies of the same code can be an anti-optimization if it
> means that a smaller percentage of it fits in the instruction cache,
> and sometimes small changes in runtime are caused by random shifts in
> the layout of memory that align things more or less favorably across
> cache lines rather than by real effects.  Now it may well be that this
> is a real effect, but will it still look as good when we do this for
> 10 data types?  For 100 data types?

I'd favour limiting it to just the common integer and float types.

> In contrast, it seems to me that reducing the impedance mismatch is
> something that we could go and do across our entire code base, and
> every single data type would benefit from it.  It would also be
> potentially usable by other sorting algorithms, not just quick sort.

Suppose that we went ahead and added that infrastructure. What you
must acknowledge is that one reason that this speed-up is so dramatic
is that the essential expense of a comparison is already so low - a
single instruction - and therefore the overall per-comparison cost
goes way down, particularly if the qsort inner loop can store the code
across fewer cache lines. For that reason, any absolute improvement
that you'll see in complex datatypes will be smaller, maybe much
smaller, because for each comparison we'll execute many more
instructions that are essential to the comparison. In my estimation,
all of this work does not point to there being an undue overhead in
the function calling convention as you suggested. Still, I'm not
opposed to investigating generalising this in some way, reservations
notwithstanding, unless we have to block-wait on it. I don't want to
chase diminishing returns too far.

> Well, that's kind of my point.  I think this needs more work before we
> decide what the best approach is.

Agreed.

> So far, the ONLY test result we
> have that compares inlining to not inlining shows a speedup from 60 ms
> to 52 ms.  I think that an 8 ms speedup on one test with one datatype
> on one platform/compiler combination isn't sufficient evidence to
> conclude that this is the best possible approach.

Fair enough, but it's not the only test I did - I posted other numbers
for the same query when the table was 48mb, and we saw a proportional
improvement, consistent with a per-comparison win. I'm supposed to be
on leave for a few days at the moment, so I won't be very active this
weekend, but I'm rather curious as to where you or others would like
to see me go with benchmarks.

I should point out that we currently don't have much idea how big of a
win applying these principles could be for index creation times...it
could possibly be very significant. My profiling of index creation
makes this looks promising.

On 18 November 2011 13:39, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> It seems entirely possible that inlining any non-trivial comparator function could work
> out to a loss,

Compilers weigh the size of the function to be inlined heavily when
deciding, on a call site by call site basis, whether or not to inline.
Much effort has been expended on making these heuristics work well.
Besides, you're the one advocating pursuing this for types with
non-trivial comparators. This will certainly be less effective for
such non-trivial types, perhaps quite a lot less effective, as already
noted.

It's worth acknowledging that sorting of integers and floats is in
general very important, probably more important than any other sorting
case.

> or that the exact choice of compiler flags could affect
> whether inlining works out to a plus or a minus (what happens with -O3
> vs. -O2?  what about -O1?  what about -O2 -fno-omit-frame-pointer?
> what if they're using HP's aCC or Intel's icc rather than gcc?).

If you'd like to see numbers for another platform, I can get those.
How about Windows? I only have access to x86_64 boxes. I don't see
that the exact compiler flags used will influence whether or not
inlining is a win, so much as whether or not it occurs at all. As to
-fno-omit-frame-pointer or something nullifying the benefits, well,
that's a bit fantastical. We're already at the mercy of the compiler
to do the right thing at this level of granularity. I really just want
to help/enable it.

> There's no point in installing an optimization that could easily be a
> pessimization on some other workload, and that hasn't been tested, or
> at least no results have been posted here.

Hmm. Well, I accept the burden of proof lies with me here. I have a
hard time believing that inlining/compile-time optimisation benefits
(benefits which, to repeat for emphasis, have not been
isolated/measured yet) could be entirely nullified by CPU caching
effects on any plausible workload. What does a benchmark that
alleviates your concerns here look like?

The way that I understand inlining to interact with CPU cache is that
it will increase the number of cache misses if it causes a cache line
to be crossed, but will decrease the number of cache misses when
locality of reference is improved such that less cache lines are
crossed within an inner loop. Well, the overall cost of a sort is
measured in the number of comparisons performed, so locality of
reference is particularly important here - in general, the comparator
is called lots of times.

What I'm having a hard time with is general scepticism of the value of
inlining, when the self-contained test case I posted showed an
inlining qsort() radically out-performing my system's qsort(), without
any impedance mismatch to muddy the waters. Sure, you can hold me to
account and get me to prove my claim, but don't you think it's curious
that that effect was observed there?

Incidentally, I'd like to call this fast path sorting, if it ends up
being committed in a form that is not significantly different to what
we have now.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2011-11-18 21:39:40 Re: RFC: list API / memory allocations
Previous Message Tom Lane 2011-11-18 21:14:22 Re: [PATCH] Replace a long chain of if's in eval_const_expressions_mutator by a switch()