Re: Querying distinct values from a large table

From: Gregory Stark <gsstark(at)mit(dot)edu>
To: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc: "Chad Wagner" <chad(dot)wagner(at)gmail(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Igor Lobanov" <ilobanov(at)swsoft(dot)com>, "Richard Huxton" <dev(at)archonet(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Querying distinct values from a large table
Date: 2007-01-30 16:38:30
Message-ID: 87d54wmpvd.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

"Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:

> Chad,
>
> On 1/30/07 6:13 AM, "Chad Wagner" <chad(dot)wagner(at)gmail(dot)com> wrote:
>
> > Sounds like an opportunity to implement a "Sort Unique" (sort of like a hash,
> > I guess), there is no need to push 3M rows through a sort algorithm to only
> > shave it down to 1848 unique records.
> >
> > I am assuming this optimization just isn't implemented in PostgreSQL?
>
> Not that it helps Igor, but we've implemented single pass sort/unique,
> grouping and limit optimizations and it speeds things up to a single seqscan
> over the data, from 2-5 times faster than a typical external sort.

Fwiw I also implemented this last September when I worked on the LIMIT/SORT
case. It's blocked on the same issue that that patch is: how do we communicate
the info that only unique records are needed down the plan to the sort node?

Note that it's not just a boolean flag that we can include like the random
access flag: The Unique node could in theory have a different key than the
sort node.

Coming at this fresh now a few months later I'm thinking instead of trying to
put this intelligence into the individual nodes, an optimizer pass could run
through the tree looking for Sort nodes that can have this bit tweaked.

That doesn't quite help the LIMIT/SORT case because the limit isn't calculated
until run-time but perhaps it's possible to finesse that issue by providing
either the Limit or Sort node with pointers to other nodes.

(Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
your data distribution. It's not hard to come up with distributions where it's
1000x as fast and others where there's no speed difference.)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Bruno Wolff III 2007-01-30 16:44:35 Re: Querying distinct values from a large table
Previous Message Luke Lonergan 2007-01-30 16:34:03 Re: Querying distinct values from a large table