Re: Abbreviated keys for text cost model fix

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Abbreviated keys for text cost model fix
Date: 2015-02-23 16:40:44
Message-ID: 54EB580C.2000904@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 22.2.2015 22:30, Peter Geoghegan wrote:
> On Sun, Feb 22, 2015 at 1:19 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> In short, this fixes all the cases except for the ASC sorted data. I
>> haven't done any code review, but I think we want this.
>>
>> I'll use data from the i5-2500k, but it applies to the Xeon too, except
>> that the Xeon results are more noisy and the speedups are not that
>> significant.
>>
>> For the 'text' data type, and 'random' dataset, the results are these:
>>
>> scale datum cost-model
>> -------------------------------
>> 100000 328% 323%
>> 1000000 392% 391%
>> 2000000 96% 565%
>> 3000000 97% 572%
>> 4000000 97% 571%
>> 5000000 98% 570%
>>
>> The numbers are speedup vs. master, so 100% means exactly the same
>> speed, 200% means twice as fast.
>>
>> So while with 'datum' patch this actually caused very nice speedup for
>> small datasets - about 3-4x speedup up to 1M rows, for larger datasets
>> we've seen small regression (~3% slower). With the cost model fix, we
>> actually see a significant speedup (about 5.7x) for these cases.
>
> Cool.
>
>> I haven't verified whether this produces the same results, but if it
>> does this is very nice.
>>
>> For 'DESC' dataset (i.e. data sorted in reverse order), we do get even
>> better numbers, with up to 6.5x speedup on large datasets.
>>
>> But for 'ASC' dataset (i.e. already sorted data), we do get this:
>>
>> scale datum cost-model
>> -------------------------------
>> 100000 85% 84%
>> 1000000 87% 87%
>> 2000000 76% 96%
>> 3000000 82% 90%
>> 4000000 91% 83%
>> 5000000 93% 81%
>>
>> Ummm, not that great, I guess :-(
>
> You should try it with the data fully sorted like this, but with one
> tiny difference: The very last tuple is out of order. How does that
> look?

So here are the results for ASC-ordered dataset, with one 'unsorted' row
added to the end of the dataset. As before the complete scripts are
attached, and the raw results are available in a spreadsheet:

http://bit.ly/18g1nTU

The durations are much higher than without the single unsorted row added
at the end. Queries often take 20x longer to finish (on the same code),
depending on the scale.

The speedup results (compared to master) look like this:

scale query# datum numeric cost model
100000 1 859% 861% 856%
100000 2 811% 814% 805%
100000 3 100% 100% 97%
1000000 1 805% 804% 807%
1000000 2 769% 773% 770%
1000000 3 100% 100% 98%
2000000 1 97% 97% 673%
2000000 2 96% 97% 646%
2000000 3 99% 101% 678%
3000000 1 98% 98% 578%
3000000 2 96% 97% 557%
3000000 3 99% 101% 579%
4000000 1 99% 99% 513%
4000000 2 97% 98% 497%
4000000 3 99% 101% 510%
5000000 1 99% 99% 469%
5000000 2 97% 98% 456%
5000000 3 99% 101% 466%

What's interesting here is that some queries are much faster, but query
#3 is slow until we hit 2M rows:

select * from (select * from stuff_int_desc order by randint
offset 100000000000) foo

Looking at the previous tests, I see this is exactly what's happening to
this query with 'random' dataset - it's slightly slower than master up
until 2M rows, when it suddenly jumps to the same speedup as the other
queries. Can we do something about that?

Anyway, I'm wondering what conclusion we can do from this? I believe
vast majority of datasets in production won't be perfectly sorted,
because when the table is CLUSTERed by index we tend to use index scan
to do the sort (so no problem), or the data are not actually perfectly
sorted (and here we get significant speedup).

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
scripts.tgz application/x-tgz 18.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-02-23 16:53:59 Re: Primary not sending to synchronous standby
Previous Message Amit Kapila 2015-02-23 16:31:40 Re: Allow "snapshot too old" error, to prevent bloat