Re: Re: Abbreviated keys for Datum tuplesort

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Abbreviated keys for Datum tuplesort
Date: 2015-04-02 21:34:39
Message-ID: CAM3SWZTY9fmrn9q8uMsqn9bnedtjk68FKitoMhHSXY2kOyv9xA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 2, 2015 at 8:17 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Feb 20, 2015 at 3:57 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> I've spent a fair amount of testing this today, and when using the
>> simple percentile_disc example mentioned above, I see this pattern:
>>
>> master patched speedup
>> ---------------------------------------------------------
>> generate_series(1,1000000) 4.2 0.7 6
>> generate_series(1,2000000) 9.2 9.8 0.93
>> generate_series(1,3000000) 14.5 15.3 0.95
>>
>>
>> so for a small dataset the speedup is very nice, but for larger sets
>> there's ~5% slowdown. Is this expected?

I think it's explained by the pre-check for sorted input making the
number of comparisons exactly n -1. As I pointed out to Tomas, if you
put a single, solitary unsorted element at the end, the abbreviated
version is then 8x faster (maybe that was in relation to a slightly
different case, but the pattern is the same). So this case isn't an
argument against datum abbreviation, or even abbreviation in general,
but rather an argument against using strxfrm() in general (which for
example the GCC docs strongly recommend for sorting lists of strings).
It's a bad argument, IMV. This sort is already extremely fast.

BTW, Corey Huinker (who I believe you also spoke to during pgConf.US)
reported that his problematic CREATE INDEX case was 12x faster with
9.5. That used tapesort. That was also perfectly pre-sorted, but since
it was using tapesort and abbreviated there was a huge improvement.

> I had a look at this patch today with a view to committing it, but it
> seems that nobody's commented on this point, which seems like an
> important one. Any thoughts?
>
> For what it's worth, and without wishing to provoke another flamewar,
> I am inclined to use Andrew's version of this patch rather than
> Peter's. I have not undertaken an exhaustive comparison, nor do I
> intend to. It is the reviewer's responsibility to justify the changes
> they think the author needs to make, and that wasn't done here. On
> the points of difference Andrew highlighted, I think his version is
> fine.

The justification is that Andrew's version had arbitrary differences
with the existing code, in particular in the datum comparator, which
was different to all other cases in the way that Andrew pointed out.
Overall, there were only tiny differences between the two versions. I
see no reason to not match the existing style. The changes that Andrew
took issue with are utterly insignificant.

Also, the changes that Andrew didn't mention are clearly appropriate.
In particular, the comments on the SortKeys variable being used by
every case except the hash case and datum case should still be updated
to reflect that that's only true for the hash case now.
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2015-04-02 21:36:29 Re: TABLESAMPLE patch
Previous Message Heikki Linnakangas 2015-04-02 21:33:10 Re: What exactly is our CRC algorithm?