Re: Add proper planner support for ORDER BY / DISTINCT aggregates

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Date: 2021-07-02 08:39:44
Message-ID: CAApHDvpYZwbRNurFsG+cvbUFBaFaiF6J6VWKBvWdtVsbU-Bfgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2 Jul 2021 at 19:54, Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io> wrote:
> I don't know if it's acceptable, but in the case where you add both an
> aggregate with an ORDER BY clause, and another aggregate without the clause,
> the output for the unordered one will change and use the same ordering, maybe
> suprising the unsuspecting user. Would that be acceptable ?

That's a good question. There was an argument in [1] that mentions
that there might be a group of people who rely on aggregation being
done in a certain order where they're not specifying an ORDER BY
clause in the aggregate. If that group of people exists, then it's
possible they might be upset in the scenario that you describe.

I also think that it's going to be pretty hard to make significant
gains in this area if we are too scared to make changes to undefined
behaviour. You wouldn't have to look too hard in the pgsql-general
mailing list to find someone complaining that their query output is in
the wrong order on some query that does not have an ORDER BY. We
pretty much always tell those people that the order is undefined
without an ORDER BY. I'm not too sure why Tom in [1] classes the ORDER
BY aggregate people any differently. We'll be stuck forever here and
in many other areas if we're too scared to change the order of
aggregation. You could argue that something like parallelism has
changed that for people already. I think the multi-batch Hash
Aggregate code could also change this.

> I was curious about the performance implication of that additional transition,
> and could not reproduce a signifcant difference. I may be doing something
> wrong: how did you highlight it ?

It was pretty basic. I just created a table with two columns and no
index and did something like SELECT a,SUM(b ORDER BY b) from ab GROUP
BY 1; the new code will include a Sort due to lack of any index and
the old code would have done a sort inside nodeAgg.c. I imagine that
the overhead comes from the fact that in the patched version nodeAgg.c
must ask its subnode (nodeSort.c) for the next tuple each time,
whereas unpatched nodeAgg.c already has all the tuples in a tuplestore
and can fetch them very quickly in a tight loop.

David

[1] https://www.postgresql.org/message-id/6538.1522096067%40sss.pgh.pa.us

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2021-07-02 08:47:05 Re: Commit fest manager
Previous Message Dean Rasheed 2021-07-02 08:31:41 Re: rand48 replacement