Re: POC: GROUP BY optimization

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: GROUP BY optimization
Date: 2018-06-06 18:04:09
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> problem 2). Index optimization was noticed by me later. But based on your
>> suggested patch's order I split the patch to index and non-index part and
>> second part depends of first one. They touch the same part of code and they
>> could not be independent
> The way I see it the patch does two different things:

> a) consider additional indexes by relaxing the pathkeys check
Yes, but not only. Patch could reorder GROUP BY clause to match existing
pathkeys which could come from index scan (suggested by patch or not) or some
another way to get ordered output.

> b) if there are no indexes, i.e. when an explicit Sort is needed, consider
> reordering the columns by ndistinct
Yes. But again, this description is a bit short. First it works after first
patch and might get some preordered leading pathkeys. Second, it tries to match
ORDER BY clause order if there is no preordered leading pathkeys from first
patch (it was introduced in v7). And third, if there is a tail of unmatched
pathkeys on previous stages then it will reorder that tail.

> In the worst case, one part will depend on the other, which is OK. It still
> allows us to commit the first part and continue working on the other one, for
> example.
Exactly it's our case. Of course, it's possible to split first patch for two
again: just suggestion of index (1.1) and reordering by existing pathkeys (1.2).
Then 1.1 will be independent but 2 still should be applied after 1.2. But patch
1.1 is rather useless.

> can decide which one is cheaper etc. The decision which paths to keep is done by
> add_path(), and it should stay like this, of course. I wasn't suggesting to move
> the logic elsewhere.
Cool, I haven't intention to modify it too.

>>  > I wonder if we should make the add_path() logic smarter to recognize when two
>>  > paths have different pathkeys but were generated to match the same grouping,
>>  > to reduce the number of paths added by this optimization. Currently we do
>> that > for each pathkeys list independently, but we're considering many more
>> pathkeys > orderings ...
>> Hm. I tend to say no.
>> select .. from t1 group by a, b
>> union
>> select .. from t2 group by a, b
>> t1 and t2 could have different set of indexes and different distribution, so
>> locally it could be cheaper to use one index (for example, one defined as (b,
>> a) and second as (a,b,c,d) - second is larger) but totally - another (example:
>> second table doesn't have (b,a) index)
> But add_path() treats each of the relations independently, why couldn't we pick
> a different index for each of the two relations?

Having of sorted output of both subselect could be cheaper that sorting one of
them even if index scan was cheaper. But it seems to me that is not deal of
suggested here optimization.

>>> For example, it seems to disregard that different data types have different
>>> comparison costs. For example comparing bytea will be far more expensive
>>> compared to int4, so it may be much more efficient to compare int4 columns
>>> first, even if there are far fewer distinct values in them.
>> as I can see cost_sort() doesn't pay attention to this details. And it should
>> be a separated patch to improve that.
> But sort also does not reorder columns.
Yes. But estimation of cost of comparing function/number of unique values in
column could be not very accurate and so planner could make a wrong choice. I
saw 2 times difference in real-world application. Again, improving sort cost
estimation is a separate task.

> Imagine you have a custom data type that is expensive for comparisons. You know
> that, so you place it at the end of GROUP BY clauses, to reduce the number of
> comparisons on that field. And then we come along and just reorder the columns,
> placing it first, because it happens to have a high ndistinct statistic. And
> there's no way to get the original behavior :-(
Hm. that it could be, but I don't know how to improve here. Current cost_sort()
will return the same cost for any columns order.

Okay, here we know an estimation of nrow, we could somehow find a comparing
function oid and find a its procost field. And then? we also need to know width
of tuple here and mostly repeat the cost_sort.

Another option is an introducing enable_groupby_reorder GUC variable although
personally I don't like an idea to add new GUC variable.

>> Agree, but I don't know how to use it here. Except, may be:
>> 1) first column - the column with bigger estimated number of groups
>> 2) second column - the pair of (first, second) with bigger estimated number of
>> groups
>> 3) third column - the triple of (first, second, third) with bigger ...
>> But seems even with that algorithm, ISTM, it could be implemented in cheaper
>> manner.
> Maybe. I do have some ideas, although I haven't tried implementing it.
Implemented, pls, have a look.

> If there's no extended statistic on the columns, you can do the current thing
> (assuming independence etc.). There's not much we can do here.

> If there's an extended statistic, you can do either a greedy search (get the
> next column with the highest ndistinct coefficient) or exhaustive search
> (computing the estimated number of comparisons).
> Another challenge is that using only the ndistinct coefficient assumes uniform
> distribution of the values. But you can have a column with 1M distinct values,
> where a single value represents 99% of the rows. And another column with 100k
> distinct values, with actual uniform distribution. I'm pretty sure it'd be more
> efficient to place the 100k column first.

Interesting. Will think, thank you

Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru

Attachment Content-Type Size
0001-opt_group_by_index-v8.patch text/x-patch 16.9 KB
0002-opt_group_by_index_and_order-v8.patch text/x-patch 10.0 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-06-06 18:09:22 Re: Code of Conduct plan
Previous Message Tom Lane 2018-06-06 17:52:13 Re: buildfarm vs code