Re: Progress on fast path sorting, btree index creation time

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Progress on fast path sorting, btree index creation time
Date: 2012-01-19 18:47:10
Message-ID: CAEYLb_UpgkSDs+tNqp_TOmZR04TxpEpqGGupt5E3333D6m53Ag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6 January 2012 21:47, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Jan 6, 2012 at 4:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Admittedly, I don't have any numbers quantifying just how useful that
>> might be, but on the other hand you've not presented any evidence
>> justifying removing the behavior either.  If we believe your position
>> that indexes don't generally have lots of duplicates, then the code in
>> question will seldom be reached and therefore there would be no
>> performance benefit in removing it.

I have decided on a tactical retreat in relation to this patch. This
has been dragging on since September. I cannot risk not having the
patch accepted for 9.2, due to trying to handle both heap and btree
tuple sorting at once - the additional relatively modest improvements
for common cases that it will bring to btree index creation time do
not warrant digging my heels in to cover that case in one larger
commit. For that reason, I attach for your consideration a revision of
the patch without any support for btree index tuples whatsoever
(though I have adjusted btree tuplesort comments, and one tiny piece
of code, in a non-controversial way). I'll revisit btree sorting
towards the end of the commitfest, circumstances permitting.

As to the question of binary bloat, I have devised a pgbench-tools
workload that I think will go some way towards reassuring those who
are concerned about its distributed costs. The attached spreadsheet
has all the relevant details.

A custom scripts has been specified (details of which are evident from
benchmark results themselves). If we experienced some kind of
noticeable marginalisation of usefully cached instructions, that's
obviously where it'll show up.

Note that I have taken the preparatory step of updating some tables
with random data, rather than the sequential data that they have by
default, which, unusually for such large tables, can allow us to avail
of our pre-sorted input check optimisation, spoiling things. I did
vacuumdb immediately afterwards, immediately before the pgbench run in
each case.

I've kept the test at 8 clients/8 threads, on a machine with 4
physical cores + hyperthreading for 8 logical ("Intel Core(TM) i7 CPU
870 @ 2.93GHz", with 4 x 32 KB instruction caches, among several
other CPU caches) on a newly-initialised, throw-away database, single
run at scale 50 for 15 minutes in all cases. This is the same server
that I have used throughout.

My analysis is that although there may be a very slight regression in
non-affected queries (it's a tough call to make - how queries happen
to coincide undoubtedly muddies the waters, as does the fact that we
cut lots of time from periods in which backends hold a lot of locally
allocated sort memory - it might actually be a win for those other
queries sometimes but not others, and it appears that the margin
either way is very small), that is more than compensated for by the
benefit of the specialisations. The CPU cache is doing its fjob as
expected, and we clearly see a net benefit from its preference for
cacheing more instructions that are specialised - those sort
operations are way more expensive (if they weren't, it wouldn't cache
them so heavily). I also include results for running the same query
again and again with a single client, to put that effect in context
(though note that previous benchmarks avoiding paying a high client
overhead by using explain analyze, and indeed originally compared
pre-SortSupport Postgres, so these numbers aren't as good either).
It's unusual for a database workload to be so heavily CPU bound, so
I'd suggest that the latter benchmark is more representative than the
former. Either way, we win by some margin. If the queries didn't have
such a high client overhead, as for example with a sort node that
feeds a merge join, we'd do better still.

If a sorting specialisation is never used anyway, the overhead, for
practical purposes, is zero. I believe that sorting specialisations
are just too useful to not be a net win in almost all reasonable
cases. Why haven't I used all specialisations at once, rather than
only two (one for single int4, the other multiple)? Well, I might have
used all of them, but there was no floats available in the pgbench
tables, and besides, the chances of all of them being simultaneously
in play during any sufficiently short period for marginalisation of
CPU cache contents to be of particular concern is, in general, not all
that great.

A major systems programming language, C++, produces multiple
specialisations as its standard library's sort function is used, for
example, each of which will have separate entries in the procedure
linkage table (though various implementation-defined techniques are
frequently used to reduce the number of copies across translation
units at link time). If the programmer specifies either a different
datatype, or a different comparator (through the use of an alternative
to the default std::less_than<T> functor/predicate), a whole new
specialisation is generated by the compiler. This does raise concerns
in relation to binary bloat, but they're reasonably well understood,
and this is the default way of doing things, for general purpose
application development. Now, I know that Postgres isn't written in
C++, and you might not find "C++ does it" to be a particularly
compelling argument, but it does go to show that these ideas are not
by any stretch of the imagination radical.

I remind everyone that the improvement seen in raw qsort_arg runtime
is fairly large, as it would have to be, in order to bring such large
though proportionally smaller improvements to each given affected
query as a whole - there are of course many other sources of overhead
involved in parsing, planning and executing affected queries.

Here is a complete detailing of the specialisations that this latest
revision produces:

int4, single key (supports date too)
int8, multiple keys (supports timestamps too, with and without TZ,
where we HAVE_INT64_TIMESTAMP)
float4, single key
float8, multiple keys
Type-generic single key specialisation. Expected to deliver some of
the same additional benefits for types that do not merit there own
specialisations but would still benefit, like name, int2, etc, but is
used all the time where applicable, even for types like text for which
it is expected that there will be no appreciable benefit.

Thoughts?
--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment Content-Type Size
benchmark_fastpath_sort.ods application/vnd.oasis.opendocument.spreadsheet 11.0 KB
fastpath_sort_2012_01_19.patch text/x-patch 40.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2012-01-19 18:50:46 Re: review of: collation for (expr)
Previous Message Alex Shulgin 2012-01-19 18:41:54 Re: automating CF submissions (was xlog location arithmetic)