Re: Abbreviated keys for text cost model fix

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Abbreviated keys for text cost model fix
Date: 2015-02-23 18:22:32
Message-ID: CAM3SWZTVach_k6jMzncj=uj8vnsKSGdsZFCnMG2z8stz6t7Z8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 23, 2015 at 8:40 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> The durations are much higher than without the single unsorted row added
> at the end. Queries often take 20x longer to finish (on the same code),
> depending on the scale.

I knew this would happen. :-)

> What's interesting here is that some queries are much faster, but query
> #3 is slow until we hit 2M rows:
>
> select * from (select * from stuff_int_desc order by randint
> offset 100000000000) foo

Not sure why that would be. Can you see what a "#define
DEBUG_ABBREV_KEYS" (within varlena.c) build shows happens with the
cost model? Enable debug1 output for that.

> Looking at the previous tests, I see this is exactly what's happening to
> this query with 'random' dataset - it's slightly slower than master up
> until 2M rows, when it suddenly jumps to the same speedup as the other
> queries. Can we do something about that?

Possibly.

> Anyway, I'm wondering what conclusion we can do from this? I believe
> vast majority of datasets in production won't be perfectly sorted,
> because when the table is CLUSTERed by index we tend to use index scan
> to do the sort (so no problem), or the data are not actually perfectly
> sorted (and here we get significant speedup).

One conclusion might be that commit a3f0b3d6 should not have added the
pre-check for perfectly sorted input, which is not in the Bentley &
Mcilroy original - exploiting the fact that the set is already
partially sorted is a fine thing, but we're not doing the necessary
bookkeeping. But that's not really why I suggested that test case.
Rather, I wanted to put the possible regression in the event of
perfectly sorted input in perspective. Maybe that regression isn't
worth worrying about, if in the event of a single tuple being out of
order at the end of the set the outcome dramatically shifts to
strongly favor abbreviation. Another way of looking at it would be
that the pre-sorted case is an argument against strxfrm() sorting in
general more than abbreviated keys in particular, an argument that
seems new and strange to me. The analysis of algorithms focuses on the
worst case and average case for a reason, and (say) the glibc
documentation never considers this when discussing strxfrm().

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2015-02-23 19:01:35 OBJECT_ATTRIBUTE is useless (or: ALTER TYPE vs ALTER TABLE for composites)
Previous Message Pavel Stehule 2015-02-23 17:56:47 Re: json_populate_record issue - TupleDesc reference leak