Re: Inlining comparators as a performance optimisation

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Inlining comparators as a performance optimisation
Date: 2011-11-20 03:29:07
Message-ID: CAEYLb_W64XrbzOnQVeQ4BO8CjGjrRB2rToORTUgLDiAeDWdC1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19 November 2011 02:55, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Maybe we should look at trying to isolate that a bit better.

Indeed. Fortunately, GCC has options to disable each optimisation.
Here's potentially relevant flags that we're already implicitly using
at -02:

-finline-small-functions <-- this is the one that inlined functions
without an "inline" keyword
-findirect-inlining
-finline-functions-called-once
-fearly-inlining
-fipa-sra (Perform interprocedural scalar replacement of aggregates,
removal of unused parameters and replacement of parameters passed by
reference by parameters passed by value).

In an effort to better isolate the effects of inlining, I tried this:

./configure CFLAGS="-fno-inline -fno-inline-small-functions" (I could
have disabled more -02 optimisations, but this proved sufficient to
make my point)

Unsurprisingly, this makes things slower than regular -02.

With the patch, the same query (once again, using explain analyze)
with the same fast path quicksort stabilises around 92ms +/- 5ms
(recall that the original figure was ~82ms). Our gains evaporate, and
then some. Take away the additional CLAGS, and we're predictably back
to big gains, with the query taking ~52ms just as before.

What happens to this query when we build an unmodified postgres with
these same CFLAGS? Well, we see the query take ~116ms after a few
runs. It seems that the impedance mismatch matters, but inlining and
other optimisations look to be at least as important. This isn't
surprising to me, given what I was able to do with the isolated test.
Maybe I should have tried it with additional disabling of
optimisations named above, but that would have perhaps made things
less clear.

I'd probably have been better off directly measuring qsort speed and
only passing those flags when compiling tuplesort.c (maybe impedance
mismatch issues would have proven to have been even less relevant),
but I wanted to do something that could be easily recreated, plus it's
late.

> It strikes me that we could probably create an API that would support
> doing either of these things depending on the wishes of the underlying
> datatype.  For example, imagine that we're sorting with <(int4, int4).
>  We associate a PGPROC-callable function with that operator that
> returns "internal", really a pointer to a struct.  The first element
> of the struct is a pointer to a comparison function that qsort() (or a
> tape sort) can invoke without a trampoline; the second is a wholesale
> replacement for qsort(); either or both can be NULL.  Given that, it
> seems to me that we could experiment with this pretty easily, and if
> it turns out that only one of them is worth doing, it's easy to drop
> one element out of the structure.
>
> Or do you have another plan for how to do this?

I haven't given it much thought. Let me get back to you on that next week.

> Have you done any benchmarks where this saves seconds or minutes,
> rather than milliseconds?  That would certainly make it more exciting,
> at least to me.

Right. Well, I thought I'd use pgbench to generate a large table in a
re-creatable way. That is:

pgbench -i -s 60

This puts pgbench_accounts at 769MB. Then, having increased work_mem
to 1GB (enough to qsort) and maintenance_work_mem to 756mb, I decided
to test this query with the patch:

explain analyze select * from pgbench_accounts order BY abalance;

This stabilised at ~3450ms, through repeatedly being executed.

How does this compare to unpatched postgres? Well, it stabilised at
about ~3780ms for the same query.

This patch is obviously less of a win as the number of tuples to sort
goes up. That's probably partly explained by the cost of everything
else going up at a greater rate than the number of comparisons. I
suspect that if we measure qsort in isolation, we'll see better
results, so we may still see a good win on index creation time as a
result of this work.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2011-11-20 04:13:36 Re: Inlining comparators as a performance optimisation
Previous Message Florian Pflug 2011-11-20 00:59:08 Re: Singleton range constructors versus functional coercion notation