Re: Inlining comparators as a performance optimisation

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Inlining comparators as a performance optimisation
Date: 2011-11-25 01:05:15
Message-ID: CAEYLb_WAtXZ67BaPryQ6n_tRVf+2=HTU1dRkYK2DMdBo7n=1Fw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 23 November 2011 19:24, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Well, right now the decision as to which mechanism should be used here
> gets made in tuplesort_performsort(), which has no good way of
> communicating with EXPLAIN anyway.

You could pretty easily add something to Tuplesortstate to accomplish
this. That isn't an endorsement of doing so, but I'm not sure that it
isn't appropriate.

> Actually, I think that's a
> modularity violation; using the address of comparetup_heap as a flag
> value seems quite ugly. How about moving that logic up to
> tuplesort_begin_heap()

I'll post a patch soon that does just that in the next day or two.
Tuplesortstate has a pointer to a sort specialisation.

> and having it set some state inside the
> Tuplesort, maybe based on a flag in the opclass (or would it have to
> attach to the individual operator)?

I'm not sure that there's much point in such a flag.

> At least on my machine, your latest patch reliably crashes the
> regression tests in multiple places.

> TRAP: FailedAssertion("!(state->nKeys == 1)", File: "tuplesort.c", Line: 1261);

Yes, sorry about that. Should have been discriminating against nKeys > 1 cases.

As of this evening, for sorts with multiple scankeys, I'm using
optimisations for the first scankey but not subsequent scankeys, which
is frequently almost as good as having optimisations for all scanKeys.

> The formatting of src/include/utils/template_qsort_arg.h is hard to
> read.  At ts=8, the backslashes line up, but the code doesn't fit in
> 80 columns.  If you set ts=4, then it fits in 80 columns, but the
> backslashes don't line up any more, and the variable declarations
> don't either.  I believe ts=4 is project standard.

Fair enough. My working copy and .vimrc have been updated.

> I still think it would be a good idea to provide a mechanism to
> override heap_comparetup() with a type-specific function.  I don't
> think that would take much extra code, and then any data type could
> get at least that much benefit out of this.

> It seems like it could be a good idea to do some
> per-assembler-instruction profiling of this code, and perhaps also of
> the original code.  I'm curious where the time is being spent.

How would you go about doing that? The instrumentation that profilers
use actually caused a big drop in performance here when I attempted it
a few weeks ago. There's a kind of Heisenberg effect.

This optimisation *more than doubles* raw sort performance for the
cases. There is nothing contrived or cherry picked about the query
that I selected to represent this optimisation - it was literally the
first one that I selected.

Sometimes, I see even a markedly better gain than a doubling of raw
sort performance - I think my earlier experiments that indicated a
much smaller improvement past a certain point may have been
methodologically flawed. Sorry about that.

If I double-up the data in the orderlines table a few times, until it
reaches 385 MB (duplicate ever tuple with an insert into ...select ),
then warm the cache, I get very interesting results. Here, we see a
few runs of the same old query unoptimised (note that I've excluded
some cold-cache runs before these runs):

Before optimisation
==============
Total runtime: 7785.473 ms - 3.517310 secs just sorting
Total runtime: 8203.533 ms - 3.577193 secs just sorting
Total runtime: 8559.743 ms - 3.892719 secs just sorting

Total runtime: 9032.564 ms - 3.844746 secs just sorting

Total runtime: 9637.179 ms - 4.434431 secs just sorting
Total runtime: 9647.215 ms - 4.440560 secs just sorting
Total runtime: 9669.701 ms - 4.448572 secs just sorting

After optimisation
==============
Total runtime: 5462.419 ms - 1.169963 secs just sorting
Total runtime: 5510.660 ms - 1.234393 secs just sorting
Total runtime: 5511.703 ms - 1.208377 secs just sorting

Total runtime: 5588.604 ms - 1.175536 secs just sorting

Total runtime: 5899.496 ms - 1.250403 secs just sorting
Total runtime: 6023.132 ms - 1.338760 secs just sorting
Total runtime: 6717.177 ms - 1.486602 secs just sorting

This is a 800109kB sort.

So, taking the median value as representative here, that looks to be
just shy of a 40% improvement, or 3.4 seconds. My /proc/cpuinfo is
attached on the off chance that someone is interested in that. More
work is needed here, but this seems promising.

It will be interesting to see how far all of this can be taken with
comparetup_index_btree. Certainly, I'm sure there's some gain to be
had there by applying lessons learned from comparetup_heap.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment Content-Type Size
cpuinfo.txt text/plain 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-11-25 01:44:19 Re: proposal : backend startup hook / after logon trigger
Previous Message Oliver Jowett 2011-11-25 00:20:02 Re: [JDBC] Optimize postgres protocol for fixed size arrays