remove_useless_groupby_columns is too enthusiastic

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>
Subject: remove_useless_groupby_columns is too enthusiastic
Date: 2022-07-12 17:31:13
Message-ID: 1657885.1657647073@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I looked into the complaint at [1] about the planner being much
stupider when one side of a JOIN USING is referenced than the other
side. It seemed to me that that shouldn't be happening, because
the relevant decisions are made on the basis of EquivalenceClasses
and both USING columns should be in the same EquivalenceClass.
I was right about that ... but the damage is being done far upstream
of any EquivalenceClass work. It turns out that the core of the
issue is that the query looks like

SELECT ... t1 JOIN t2 USING (x)
GROUP BY x, t2.othercol ORDER BY t2.othercol LIMIT n

In the "okay" case, we resolve "GROUP BY x" as GROUP BY t1.x.
Later on, we are able to realize that ordering by t1.x is
equivalent to ordering by t2.x (because same EquivalenceClass),
and that it's equally good to consider the GROUP BY items in
either order, and that ordering by t2.othercol, t2.x would
allow us to perform the grouping using a GroupAggregate on
data that's already sorted to meet the ORDER BY requirement.
Since there happens to be an index on t2.othercol, t2.x,
we can implement the query with no explicit sort, which wins
big thanks to the small LIMIT.

In the not-okay case, we resolve "GROUP BY x" as GROUP BY t2.x.
Then remove_useless_groupby_columns notices that x is t2's
primary key, so it figures that grouping by t2.othercol is
redundant and throws away that element of GROUP BY. Now there
is no apparent connection between the GROUP BY and ORDER BY
lists, defeating the optimizations that would lead to a good
choice of plan --- in fact, we conclude early on that that
index's sort ordering is entirely useless to the query :-(

I tried the attached quick-hack patch that just prevents
remove_useless_groupby_columns from removing anything that
appears in ORDER BY. That successfully fixes the complained-of
case, and it doesn't change any existing regression test results.
I'm not sure whether there are any cases that it makes significantly
worse.

(I also kind of wonder if the fundamental problem here is that
remove_useless_groupby_columns is being done at the wrong time,
and we ought to do it later when we have more semantic info.
But I'm not volunteering to rewrite it completely.)

Anyway, remove_useless_groupby_columns has been there since 9.6
and we've not previously seen reports of cases that it makes worse,
so this seems like a corner-case problem. Hence I wouldn't risk
back-patching this change. It seems worth considering for HEAD
though, so I'll stick it in the September CF.

regards, tom lane

[1] https://www.postgresql.org/message-id/17544-ebd06b00b8836a04%40postgresql.org

Attachment Content-Type Size
prevent-removing-ORDER-BY-columns-from-GROUP-BY-1.patch text/x-diff 902 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-07-12 17:33:52 Re: Cleaning up historical portability baggage
Previous Message Yura Sokolov 2022-07-12 17:22:57 Re: Reducing the chunk header sizes on all memory context types