Re: Inlining comparators as a performance optimisation

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Inlining comparators as a performance optimisation
Date: 2011-11-26 18:58:00
Message-ID: CAEYLb_V=qkEEm+szFMMV6T69C7NPw_RF_Jq9EjiT=esjcghOzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

3 items are attached:

1. A spreadsheet, results.ods, which has results for automated
multiple runs of the "doubled up" dellstore orderlines queries (where
orderlines is 385 MB). Results are sorted, with the median (or the
lower middle value, didn't get the mean of the two middle runs) result
for each run highlighted as representative. There are 3 builds of
Postgres (HEAD, not inlined, optimized), each on its own sheet in the
spreadsheet. The cache was warmed for each query, and subsequently the
query was run 20 times.

2. A python script I hacked together that should run on anything
around Python 2.5+, with a single dependency, psycopg2. Look at --help
to see how it works. This is the script that actually generated the
figures that went into the spreadsheet, by being run once for each
type of build. You can fairly easily play along at home with this, for
these and other queries. It will spit out CSV files. It is designed to
warm the cache (by default 3 times before each query) to eliminate
caching effects. You can add your own query to the python list to have
it run by the script, to generate the same set of figures for that
query. I'm rather curious as to how much of an impact this
optimisation will have on queries with unique nodes, joins and
grouping when they rely on a sort node for their input, in the real
world. Certainly, a query need not have an ORDER BY clause to see
benefits, perhaps substantial benefits. The logic to parse sort node
details is rather unsophisticated right now though, due to my focus on
the obvious ORDER BY case.

3. The revision of the patch that was actually tested, now with
inlining specialisations for the single scanKey case, and a
non-inlined specialisation for multiple scanKeys where we can still
benefit from eliding the SQL function call machinery for the first
scanKey, which is often almost as useful as being able to do so for
all scanKeys. It also selects a sorting specialisation when it can in
tuplesort_begin_heap, per Robert's suggestion.

Reviewers will want to comment out line 731 of tuplesort.c, "#define
optimize", to quickly get unoptimized behaviour for comparative
purposes. Furthermore, if you'd like to see what this would look like
without inlining, you can simply comment out assignments of inline
variants of specialisations (i.e. int4inlqsort_arg and
int8inlqsort_arg) in tuplesort_begin_heap.

It has been suggested that I'm chasing diminishing returns by
inlining, as I go further for a smaller benefit. Of course, that's
true. However, I believe that I'm not chasing them past the point
where that ceases to make sense, and these figures support that
contention - chasing diminishing returns in the nature of this kind of
work. Here, the additional benefit of inlining accounts for over an
entire second shaved off a query that was originally 7056ms, so that's
not to be sniffed at.

I'll reiterate that qsort_arg has only been modified 4 times after its
initial commit in 2006, and these were all trivial changes. While the
way that I ape generic programming with the preprocessor is on the
ugly side, what I've done can be much more easily understood with
reference to qsort_arg itself. Robert said that duplicating the sort
function was "iffy". However, that already happened long ago, as we're
already maintaining both qsort_arg.c and qsort.c, and comments already
urge maintainers to keep the two consistent. That's not been a problem
though, because, as I've said, there has never been any call to make
substantive changes to either in all those years.

We could probably further reduce the code footprint of all of this by
having template_qsort_arg.h generate comparators directly. That might
be inflexible in a way that turns out to matter though, like if we
wanted to use this for non-HAVE_INT64_TIMESTAMP timestamps and had to
add NaN crud.

My next step is to see how this goes with hundreds of sorts in the
hundreds of megabytes on a high-end server. I don't have immediate
access to one, but I'm working that out.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment Content-Type Size
0001-Initial-commit-of-optimization.patch text/x-patch 19.6 KB
results.ods application/vnd.oasis.opendocument.spreadsheet 11.7 KB
tuplesort_test.py text/x-python 3.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2011-11-26 19:20:58 Re: Command Triggers
Previous Message Tom Lane 2011-11-26 18:56:27 Re: Avoiding repeated snapshot computation