Re: ToDo: KNN Search should to support DISTINCT clasuse?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ToDo: KNN Search should to support DISTINCT clasuse?
Date: 2012-10-23 16:25:05
Message-ID: CA+TgmoZdVZzmneJ08Wv+GK91F34co78gE66hEsbEEirA4-gA+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 22, 2012 at 7:09 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Mon, Oct 22, 2012 at 4:57 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Don't hold your breath. There are two ways the system could implement
>> the DISTINCT clause: either sort and uniq, or hashaggregate.
>> hashaggregate will destroy any input ordering, so there's no value in
>> using the index as input. sort and uniq requires the input to be sorted
>> by *all* the columns being distinct'ed, not just one, so again this
>> index isn't useful.
>
> We already have some bits that understand functional dependencies for
> distinct/group by don't we? Do we detect that col <-> 'foo' is
> functionally dependent on col? If so is it possible to construct a
> multicolumn index that can produce an ordering like [col <-> 'foo',
> col] which could be used to get distinct values of col in the knn
> order (since the first column is functionally dependent on the second
> it can be ignored for grouping purposes).
>
> Not that we can do this now but I wonder whether a lot of the pieces
> are already there and just need to be hooked up together.

I think the real issue is that a hash aggregate doesn't emit any rows
until the entire input is read, so it doesn't play nice with LIMIT.
In general there's no other possible strategy, because you could get
another row belonging to an existing group at any time right up
through the end of the input. However, when the HashAgg node is only
implementing DISTINCT (ON), you can emit each new row as soon as you
see it, and just make the hash table entry to be certain you don't
emit it again. I think someone recently submitted a patch along these
lines and we rejected it because the use case was too thin ... but
this example is making me think that the use case might not be all
that thin after all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-10-23 16:27:18 Re: Successor of MD5 authentication, let's use SCRAM
Previous Message Robert Haas 2012-10-23 16:21:31 Re: too much pgbench init output