Re: POC: GROUP BY optimization

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: POC: GROUP BY optimization
Date: 2020-06-20 14:30:10
Message-ID: 20200620143010.z3af7wa4v22h4lu4@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Sat, May 16, 2020 at 04:56:09PM +0200, Tomas Vondra wrote:
>
> So I don't think there will be a single "interesting" grouping pathkeys
> (i.e. root->group_pathkeys), but a collection of pathkeys. And we'll
> need to build grouping paths for all of those, and leave the planner to
> eventually pick the one giving us the cheapest plan ...
>
> A "brute-force" approach would be to generate all possible orderings of
> group_pathkeys, but that's probably not going to fly - it might easily
> cause an explosion in number of paths we track, etc. So we'll need to
> pick/prioritize orderings that are more likely to give us efficient
> plans, and ORDER BY seems like a good example because it means we won't
> need an explicit sort.

Yes, I see. I've already rebased the original version of patch and now
working on your suggestions. In the meantime one question:

> But there are also decisions that can be made only after we build the
> grouping paths. For example, we may have both GROUP BY and ORDER BY, and
> there is no "always correct" way to combine those. In some cases it may
> be correct to use the same pathkeys, in other cases it's better to use
> different ones (which will require an extra Sort, with additional cost).

Do I understand correctly, your idea is that in some cases it's cheaper
to use different order for GROUP BY than in ORDER BY even with an extra
sorting? In the current patch implementation there is an assumption that
says it's always better to match the order of pathkeys for both GROUP
BY/ORDER BY (which means that the only degree of freedom there is to
reorder the tail, which in turn makes both "unsorted input" and
"partially sorted input" cases from your original email essentially the
same). Can you show such an example when this assumption is invalid?

> I've only quickly skimmed the old thread, but IIRC there were two main
> challenges in getting the optimization right:
>
> 1) deciding which orderings are interesting / worth additional work
>
> I think we need to consider these orderings, in addition to the one
> specified in GROUP BY:

Yes, looks like the current patch implementation together with
preprocess_groupclause already implements this, although maybe not that
flexible.

> 2) costing the alternative orderings
>
> I think we've already discussed various ways to leverage as much
> available info as possible (extended stats, MCVs, ...) but I think the
> patch only does some of it.

Right, there were couple of ideas what to do in case of a few groups
which are too big they can invalidate current assumptions about costs.
E.g. do not apply reordering if we detected such situation, or use
"conservative" approach and take the biggest group for estimations.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2020-06-20 14:43:19 Re: Failures with wal_consistency_checking and 13~
Previous Message Amit Kapila 2020-06-20 14:12:33 Re: git.postgresql.org ok?