Re: about 7.0 LIMIT optimization

From: "Roberto Cornacchia" <rob(dot)c(at)virgilio(dot)it>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: about 7.0 LIMIT optimization
Date: 2000-02-24 00:10:36
Message-ID: 2DF9D9DEB3AE3D11783200807CFB3229@rob.c.virgilio.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>> - You cannot select the top N rows according to criterion A ordering
>> the results with a different criterion B.

> True, but I don't see how to do that with one indexscan (for that
> matter, I don't even see how to express it in the SQL subset that
> we support...)

...That's why we proposed this syntax extension:

SELECT
.
.
STOP AFTER <N> (we changed the name, but this is the LIMIT)
RANK BY <A>
ORDER BY <B>

Here you can select the best <N> rows according to <A> and then order the results on <B>.
We note that, not accounting for a similar extension, you could do the same thing only using a subselect (with an ORDER BY clause in the inner select, that is non-standard as well).

>> - If you ask for the best 10 rows, from a relation including
>> 100000 rows, you have to do a traditional sort on 100000 rows and
>> then retain only the first 10, doing more comparisons than requested.

> Not if there's an index that implements the ordering --- and if there
> is not, I don't see how to avoid the sort anyway.

Of course, if you have an index there is no problem.
It is even true that if you don't have an index there is no way to avoid the sort, but in that case we use a specialized sort, which does much less comparisons.
For example, if you want the 10 best rows from 100000, these are the average numbers of comparisons:

QuickSort: 1.6E+14
SortStop: 1.5E+11

>> - You can choose a "fast-start" plan (i.e., basically,
>> a pipelined plan), but you cannot performe an "early-stop" of
>> the stream when you have a "slow-start" plan (e.g. involving sorts
>> or hash tables).

> Why not? The executor *will* stop when it has as many output rows as
> the LIMIT demands.

Yes, but consider this very simple case:

LIMIT 10
[something else]
MergeJoin (100000 rows)
Sort (100000 rows)
SeqScan on Table1 (100000 rows)

IndexScan on Table2 (100 rows)

Assuming that referential constraints allow us to do it, we would do the following:

[something else]
MergeJoin (10 rows)
SortStop 10 (10 rows)
SeqScan on Table1 (100000 rows)
IndexScan on Table2 (100 rows)

Here, we get only 10 rows from the outer relation. *In general*, this is NOT correct, but referential constraints make it safe in many cases. You can see that in the second approach, the "[something else]" will operate with an input stream cardinality of 10, against 100000 of the first approach. This is what we call the "push-down" of the Stop operator.

> I'd be the first to admit that the cost model needs some fine-tuning
> still. It's just a conceptual structure at this point.

We hope you are not considering our posts as a criticism. We used PostgreSQL as a base to our proposal, finding good results, and now we are wondering if you are interested to continue in this sense.
Btw, DB2 currently adopts "LIMIT" optimization techniques similar to ours.

Regards

Roberto Cornacchia

===========================================================

VIRGILIO MAIL - Il tuo indirizzo E-mail gratis e per sempre
http://mail.virgilio.it/

VIRGILIO - La guida italiana a Internet
http://www.virgilio.it/


Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 2000-02-24 00:16:31 Re: [HACKERS] Cache query (PREPARE/EXECUTE)
Previous Message Jan Wieck 2000-02-24 00:04:54 Re: [HACKERS] interesting observatation regarding views and V7.0