Re: Questions regarding distinct operation implementation

From: Ankit Kumar Pandey <itsankitkp(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: pghackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Questions regarding distinct operation implementation
Date: 2022-12-01 19:10:46
Message-ID: b10d2b78-a07e-e520-0cfc-e19f0ec685b2@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 25/11/22 11:00, Ankit Kumar Pandey wrote:
>
> On 25/11/22 02:14, David Rowley wrote:
>> On Fri, 25 Nov 2022 at 06:57, Ankit Kumar Pandey
>> <itsankitkp(at)gmail(dot)com> wrote:
>>> Please let me know any opinions on this.
>> I think if you're planning on working on this then step 1 would have
>> to be checking the SQL standard to see which set of rows it asks
>> implementations to consider for duplicate checks when deciding if the
>> transition should be performed or not.  Having not looked, I don't
>> know if this is the entire partition or just the rows in the current
>> frame.
>>
>> Depending on what you want, an alternative today would be to run a
>> subquery to uniquify the rows the way you want and then do the window
>> function stuff.
>>
>> David
> Thanks David, these are excellent pointers, I will look into SQL
> standard first and so on.
>
Hi,

Looking further into it, I am bit clear about expectations of having
distinct in Windows Aggregates (although I couldn't got hands on SQL
standard as it is not in public domain but distinct in windows aggregate
is supported by Oracle and I am using it as reference).

For table (mytable):

id, name

1, A

1, A

10, B

3, A

1, A

/select avg(distinct id) over (partition by name)/ from mytable (in
oracle db) yields:

2

2

2

2

10

From this, it is seen distinct is taken across the all rows in the
partition.

I also thought of using a subquery approach: /select avg(id) over
(partition by name) from (select distinct(id), name from mytable)/

but this obviously doesn't yield right answer because result should
contain same number of rows as input. This implies we need to find
partition first and then remove duplicates within the partition.

Can we avoid any ordering/sort until existing logic finds if value is in
frame (so as to respect any /order by/ clause given by user), and once
it is determined that tuple is in frame, skip the tuple if it is a
duplicate? If aforementioned approach is right, question is how do we
check if it is duplicate? Should we create a lookup table (as tuples
coming to advance_windowaggregate can be in arbitrary order)? Or any
other approach would be better?

Any opinion on this will be appreciated.

--
Regards,
Ankit Kumar Pandey

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-12-01 19:17:53 Re: Temporary tables versus wraparound... again
Previous Message Robert Haas 2022-12-01 19:04:00 Re: Error-safe user functions