Re: Progress on fast path sorting, btree index creation time

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Progress on fast path sorting, btree index creation time
Date: 2012-01-27 04:36:19
Message-ID: CAEYLb_WYFqt-j+rJV5kW28GRNKLKN6NX=F2=JmsZfrDyZjZ_GA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 27 January 2012 03:32, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> But if we want to put it on a diet, the first thing I'd probably be
> inclined to lose is the float4 specialization.  Some members of the
> audience will recall that I take dim view of floating point arithmetic
> generally, but I'll concede that there are valid reasons for using
> float8.  I have a harder time coming up with a good reason to use
> float4 - ever, for anything you care about.  So I would be inclined to
> think that if we want to trim this back a bit, maybe that's the one to
> let go.  If we want to be even more aggressive, the next thing I'd
> probably lose is the optimization of multiple sortkey cases, on the
> theory that single sort keys are probably by far the most common
> practical case.

Obviously I don't think that we should let anything go, as the
improvement in performance is so large that we're bound to be ahead -
we only really pay for what we use anyway, and when we use a
specialisation the difference is really quite big, particularly when
you look at sorting in isolation. If a specialisation is never used,
it is more or less never paid for, so there's no point in worrying
about that. That said, float4 is obviously the weakest link. I'm
inclined to think that float8 is the second weakest though, mostly
because we get both dates and timestamps "for free" with the integer
specialisations.

> I'm not surprised that you weren't able to measure a performance
> regression from the binary bloat.  Any such regression is bound to be
> very small and probably quite difficult to notice most of the time;
> it's really the cumulative effect of many binary-size-increasing
> changes we're worried about, not each individual one.  Certainly,
> trying to shrink the binary is micro-optimimzation at its finest, but
> then so is inlining comparators.  I don't think it can be
> realistically argued that we can increasing the size of the binary
> arbitrarily will never get us in trouble, much like (for a typical
> American family) spending $30 to have dinner at a cheap resteraunt
> will never break the budget.  But if you do it every day, it gets
> expensive (and fattening).

Sure. At the risk of stating the obvious, and of repeating myself, I
will point out that the true cost of increasing the size of the binary
is not necessarily linear - it's a complex equation. I hope that this
doesn't sound flippant, but if some naive person were to look at just
the increasing binary size of Postgres and its performance in each
successive release, they might conclude that there was a positive
correlation between the two (since we didn't add flab to the binary,
but muscle that pulls its own weight and then some).

At the continued risk of stating the obvious, CPUs don't just cache
instructions - they cache data too. If we spend less than half the
time sorting data, which is the level of improvement I was able to
demonstrate against pre-SortSupport Postgres, that will surely very
often have the aggregate effect of ameliorating cache contention
between cores.

>> just a few instructions, as with float-based timestamps (I don't care
>> enough about them to provide one in core, though). It would also
>> essentially allow for user-defined sort functions, provided they
>> fulfilled a basic interface. They may not even have to be
>> comparison-based. I know that I expressed scepticism about the weird
>> and wonderful ideas that some people have put forward in that area,
>> but that's mostly because I don't think that GPU based sorting in a
>> database is going to be practical.
>
> A question for another day.

Fair enough.

>> I certainly don't care about this capability enough to defend it
>> against any objections that anyone may have, especially at this late
>> stage in the cycle. I just think that we might as well have it.
>
> I don't see any reason not too, assuming it's not a lot of code.

Good.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2012-01-27 04:45:15 Re: Speed dblink using alternate libpq tuple storage
Previous Message Abhijit Menon-Sen 2012-01-27 04:17:05 Re: JSON for PG 9.2