Re: POC: GROUP BY optimization

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: POC: GROUP BY optimization
Date: 2019-05-24 12:10:48
Message-ID: CA+q6zcVjCVkAhoLD6rjDoPGdfWDjQLD=N2tvfH1yU+Eq+iby9w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Fri, May 3, 2019 at 11:55 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> I don't recall the exact details of the discussion, since most of it
> happened almost a year ago, but the main concern about the original
> costing approach is that it very much assumes uniform distribution.
>
> For example if you have N tuples with M groups (which are not computed
> using estimate_num_groups IIRC, and we can hardly do much better), then
> the patch assumed each groups is ~N/M rows and used that for computing
> the number of comparisons for a particular sequence of ORDER BY clauses.
>
> But that may result in pretty significant errors in causes with a couple
> of very large groups.

Yes, you're right, the current version of the patch assumes uniform
distribution of values between these M groups. After some thinking I've got an
idea that maybe it's better to not try to find out what would be acceptable for
both uniform and non uniform distributions, but just do not reorder at all if
there are any significant deviations from what seems to be a "best case",
namely when values distributed among groups relatively uniformly and there are
no doubts about how complicated is to compare them.

Saying that, it's possible on top of the current implementation to check:

* dependency coefficient between columns (if I understand correctly, non
uniform distributions of values between the groups a.k.a few large groups
should be visible from an extended statistics as a significant dependency)

* largest/smallest group in MCV doesn't differ too much from an expected
"average" group

* average width and comparison procost are similar

If everything fits (which I hope would be most of the time) - apply reorder,
otherwise use whatever order was specified (which leaves the room for manual
reordering for relatively rare situations). Does it makes sense?

> But what I think we could do is using largest possible group instead of
> the average one. So for example when estimating number of comparisons
> for columns (c1,..,cN), you could look at MCV lists for these columns
> and compute
>
> m(i) = Max(largest group in MCV list for i-th column)
>
> and then use Min(m(1), ..., m(k)) when estimating the number of
> comparisons.

I see the idea, but I'm a bit confused about how to get a largest group for a
MCV list? I mean there is a list of most common values per column with
frequencies, and similar stuff for multi columns statistics, but how to get a
size of those groups?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2019-05-24 12:13:06 Re: initdb recommendations
Previous Message Peter Eisentraut 2019-05-24 12:04:05 Re: initdb recommendations