Re: Questions regarding distinct operation implementation

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Ankit Kumar Pandey <itsankitkp(at)gmail(dot)com>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, pghackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Questions regarding distinct operation implementation
Date: 2022-12-03 20:57:44
Message-ID: CAApHDvreucj2d-H_b0-f-Gg0CK-sT99a+3txUWQd6hJYrXeQcA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, 4 Dec 2022 at 08:57, Ankit Kumar Pandey <itsankitkp(at)gmail(dot)com> wrote:
> On 04/12/22 00:50, David Rowley wrote:
>> providing you can code it in such a way that you only
>> allocate one of these at once, i.e not allocate one per DISTINCT
>> aggregate all at once.
>
> I am not sure if I understand this, does it means at given time, do allocation for only one distinct aggregate
> instead of all, in case of multiple aggregates using distinct?

If you were to limit this to only working with the query you mentioned
in [1], i.e PARTITION BY without an ORDER BY, then you only need to
aggregate once per partition per aggregate and you only need to do
that once all of the tuples for the partition are in the tuplestore.
It seems to me like you could add all the records to a tuplesort and
then sort by the DISTINCT column then aggregate everything except for
consecutive duplicates. You can then aggregate any other aggregates
which share the same DISTINCT column, otherwise, you just destroy the
tuplesort and rinse and repeat for the next aggregate.

To make this work when rows can exit the window frame seems
significantly harder. Likely a hash table would be a better data
structure to remove records from, but then how are you going to spill
the hash table to disk when it reaches work_mem? As David J mentions,
it seems like you'd need a hash table with a counter to track how many
times a given value appears and only remove it from the table once
that counter reaches 0. Unsure how you're going to constrain that to
not use more than work_mem though.

Are there any other databases which support DISTINCT window aggregate
with an ORDER BY in the window clause?

David

[1] https://postgr.es/m/b10d2b78-a07e-e520-0cfc-e19f0ec685b2@gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-12-03 21:46:41 Re: Error-safe user functions
Previous Message Nathan Bossart 2022-12-03 20:16:17 Re: Generate pg_stat_get_* functions with Macros