Re: Re: Abbreviated keys for Datum tuplesort

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)heroku(dot)com>
Subject: Re: Re: Abbreviated keys for Datum tuplesort
Date: 2015-01-25 11:15:12
Message-ID: 87ppa3gx3p.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>>>> "Robert" == Robert Haas <robertmhaas(at)gmail(dot)com> writes:

>> Here's the cleaned-up version of the patch to allow abbreviated keys
>> when sorting a single Datum. This also removes comments that suggest
>> that the caller of tuplesort_begin_datum should ever have to care
>> about the abbreviated key optimization.
>>
>> I'll add this to the CF.

Robert> I think this is a good idea. Do you have a test case that
Robert> shows the benefit?

The best test case for datum sort performance is to use percentile_disc,
since that has almost no overhead beyond performing the actual sort.
(Unlike, say, count(distinct) or mode(), both of which have to do an
additional comparison pass over the sorted data; but count(distinct) is
probably the most common use of the datum sort in the wild, so it's
useful to try that too.)

So given some suitable test data, such as

create table stuff as select random()::text as randtext
from generate_series(1,1000000); -- or however many rows

you can do

select percentile_disc(0) within group (order by randtext) from stuff;

or

select count(distinct randtext) from stuff;

The performance improvements I saw were pretty much exactly as expected
from the improvement in the ORDER BY and CREATE INDEX cases.

The best test case for checking the correct order of results is to use
array_agg(x order by x), for example as follows:

select u, u <= lag(u) over ()
from (select unnest(a) as u
from (select array_agg(randtext order by randtext)
from stuff) s1) s2;

(note that array_agg(x order by x) uses the datum sort, but
array_agg(x order by y) uses the ordinary heap tuple sort)

--
Andrew (irc:RhodiumToad)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2015-01-25 11:22:32 Re: New CF app deployment
Previous Message Pavel Stehule 2015-01-25 10:23:20 Re: proposal: row_to_array function