Re: Progress on fast path sorting, btree index creation time

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Progress on fast path sorting, btree index creation time
Date: 2012-01-06 17:29:04
Message-ID: CA+TgmoZ_8wSrwu6xLSVi300SW0102=1L2KJ-hnP_R_mTstyLHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan <peter(at)2ndquadrant(dot)com> wrote:
> As you know, all queries tested have lots and lots of duplicates (a
> ~1.5GB table that contains the same number of distinct elements as a
> ~10MB table once did), and removing the "duplicate protection" for
> btrees, on top of everything else, sees the qsort do an awful lot
> better than HEAD does with the psuedo-protection.

Does that also win vs. unpatched master? If so we may as well pull
that part out and commit it separately.

>> I can't find any description of how this actually works... obviously,
>> we'd like to find a close-to-median element, but how do we actually do
>> that cheaply?
>
> Uh, it works whatever way you want it to work. The implementation has
> to figure out how to get the median ahead of time.

Oh. I'd be inclined to think that would be pretty inefficient, unless
we only did it for very large partitions or when we observed that some
less costly strategy was tanking.

>> I gather from a quick read that this is supposed to be especially good
>> when the data is already somewhat sorted.  Would there be any merit in
>> trying to guess when that's likely to be the case (e.g. based on the
>> correlation statistic)?   That seems like a stretch, but OTOH a GUC
>> doesn't feel great either: what is the poor user to do with a query
>> that does 2 sorts, one of which is faster with QS and the other of
>> which is faster with Timsort?  Ugh.
>
> I'd imagined that it might work a bit like default_statistics_target.
> Of course, I was just thinking out loud. Also, we do a very good job
> on *perfectly* pre-sorted input because we check for pre-sorted input.

We occasionally have people who want to do SELECT * FROM foo WHERE a >
100 AND a < 200 ORDER BY a, b where foo has an index on (a) but not
(a, b). This tends to suck, because we fail to realize that we can
batch the sort. We either seqscan and filter the table then sort the
results, or else we scan the index on (a) and then ignore the fact
that we have data which is already partially sorted. It's
particularly annoying if a is "mostly unique" and needs b only
occasionally to break ties. It would be nice to improve this, but it
may be more of a planner/executor problem that something we can fix
directly inside the sort implementation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-01-06 17:30:22 Re: Poorly thought out code in vacuum
Previous Message Tom Lane 2012-01-06 17:24:55 Re: Poorly thought out code in vacuum