Optimizing DISTINCT with LIMIT

From: tmp <skrald(at)amossen(dot)dk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Optimizing DISTINCT with LIMIT
Date: 2008-12-04 13:32:51
Message-ID: gh8m5v$7oj$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

As far as I have understood the following query
SELECT DISTINCT foo
FROM bar
LIMIT baz
is done by first sorting the input and then traversing the sorted data,
ensuring uniqueness of output and stopping when the LIMIT threshold is
reached. Furthermore, a part of the sort procedure is to traverse input
at least one time.

Now, if the input is large but the LIMIT threshold is small, this
sorting step may increase the query time unnecessarily so here is a
suggestion for optimization:
If the input is "sufficiently" large and the LIMIT threshold
"sufficiently" small, maintain the DISTINCT output by hashning while
traversing the input and stop when the LIMIT threshold is reached. No
sorting required and *at* *most* one read of input.

Use case: Websites that needs to present small samples of huge queries fast.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2008-12-04 13:48:05 Re: In-place upgrade: catalog side
Previous Message Gregory Stark 2008-12-04 13:21:37 Assertion failure in new outer/semi/anti join code