Re: POC: GROUP BY optimization

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: GROUP BY optimization
Date: 2018-06-06 13:52:52
Message-ID: 7edddb4f-9ddd-561d-2bde-47340f91d2a0@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 06/05/2018 07:56 PM, Teodor Sigaev wrote:
>> Thanks for the patch. This (missing) optimization popped-up repeatedly
>> recently, and I was planning to look into it for PG12. So now I don't
>> have to, because you've done all the hard work ;-)
> You are welcome. Actually one of out customers faced the problem with
> GROUP BY column order and exactly with reordering without any indexes,
> you mean it as problem 2). Index optimization was noticed by me later.
> But based on your suggested patch's order I split the patch to index
> and non-index part and second part depends of first one. They touch the
> same part of code and they could not be independent
>

The way I see it the patch does two different things:

a) consider additional indexes by relaxing the pathkeys check

b) if there are no indexes, i.e. when an explicit Sort is needed,
consider reordering the columns by ndistinct

Not sure why those two parts couldn't be separated. I haven't tried
splitting the patch, of course, so I may be missing something.

In the worst case, one part will depend on the other, which is OK. It
still allows us to commit the first part and continue working on the
other one, for example.

>> 1) add_path() ensures that we only keep the one cheapest path sorted
>> path for each pathkeys. This patch somewhat defeats that because it
>> considers additional pathkeys (essentially any permutation of group
>> keys) as interesting. So we may end up with more paths in the list.
> Seems, index scan here could give benefits here only if:
>   1) it's a index only scan
>   2) it's a index full (opposite to only) scan but physical order of
> heap is
>      close to logical index order (ie table is clustered)
>
> In other cases costs of disk seeking will be very high. But on this
> stage of planing we don't know that facts yet. So we couldn't make a
> good decision here and should believe in add_path() logic.
>
Not sure what you mean? Surely we do costing of the paths at this stage,
so we can decide which one is cheaper etc. The decision which paths to
keep is done by add_path(), and it should stay like this, of course. I
wasn't suggesting to move the logic elsewhere.

> > I wonder if we should make the add_path() logic smarter to recognize
> when two
> > paths have different pathkeys but were generated to match the same
> grouping,
> > to reduce the number of paths added by this optimization. Currently
> we do that > for each pathkeys list independently, but we're considering
> many more pathkeys > orderings ...
>
> Hm. I tend to say no.
> select .. from t1 group by a, b
> union
> select .. from t2 group by a, b
>
> t1 and t2 could have different set of indexes and different
> distribution, so locally it could be cheaper to use one index (for
> example, one defined as (b, a) and second as (a,b,c,d) - second is
> larger) but totally - another (example: second table doesn't have (b,a)
> index)
>

But add_path() treats each of the relations independently, why couldn't
we pick a different index for each of the two relations?

>> 2) sort reordering based on ndistinct estimates
>
>> But thinking about this optimization, I'm worried it relies on a
>> couple of important assumptions. For now those decisions could have be
>> made by the person writing the SQL query, but this optimization makes
>> that impossible. So we really need to get this right.
> Hm, sql by design should not be used that way, but, of course, it's used :(
>

Well, yes and no. I'm not worried about people relying on us to give
them some ordering - they can (and should) add an ORDER BY clause to fix
that. I'm more worried about the other stuff.

>> For example, it seems to disregard that different data types have
>> different comparison costs. For example comparing bytea will be far
>> more expensive compared to int4, so it may be much more efficient to
>> compare int4 columns first, even if there are far fewer distinct
>> values in them.
> as I can see cost_sort() doesn't pay attention to this details. And it
> should be a separated patch to improve that.
>

But sort also does not reorder columns.

Imagine you have a custom data type that is expensive for comparisons.
You know that, so you place it at the end of GROUP BY clauses, to reduce
the number of comparisons on that field. And then we come along and just
reorder the columns, placing it first, because it happens to have a high
ndistinct statistic. And there's no way to get the original behavior :-(

>> Also, simply sorting the columns by their ndistinct estimate is
>> somewhat naive, because it assumes the columns are independent.
>> Imagine for example a table with three columns:
>> So clearly, when evaluating GROUP BY a,b,c it'd be more efficient to
>> use "(a,c,b)" than "(a,b,c)" but there is no way to decide this merely
>> using per-column ndistinct values. Luckily, we now have ndistinct
>> multi-column coefficients which could be used to decide this I believe
>> (but I haven't tried).
> Agree, but I don't know how to use it here. Except, may be:
> 1) first column - the column with bigger estimated number of groups
> 2) second column - the pair of (first, second) with bigger estimated
> number of groups
> 3) third column - the triple of (first, second, third) with bigger ...
>
> But seems even with that algorithm, ISTM, it could be implemented in
> cheaper manner.
>

Maybe. I do have some ideas, although I haven't tried implementing it.

If there's no extended statistic on the columns, you can do the current
thing (assuming independence etc.). There's not much we can do here.

If there's an extended statistic, you can do either a greedy search (get
the next column with the highest ndistinct coefficient) or exhaustive
search (computing the estimated number of comparisons).

Another challenge is that using only the ndistinct coefficient assumes
uniform distribution of the values. But you can have a column with 1M
distinct values, where a single value represents 99% of the rows. And
another column with 100k distinct values, with actual uniform
distribution. I'm pretty sure it'd be more efficient to place the 100k
column first.

>
>> The real issue however is that this decision can't be made entirely
>> locally. Consider for example this:
>>
>>      explain select a,b,c, count(*) from t group by a,b,c order by c,b,a;
>>
>> Which is clearly cheaper (at least according to the optimizer) than
>> doing two separate sorts. So the optimization actually does the wrong
>> thing here, and it needs to somehow consider the other ordering
>> requirements (in this case ORDER BY) - either by generating multiple
>> paths with different orderings or some heuristics.
> Hm, thank you. I consider it is a bug of my implementation - basic idea
> was that we try to match already existing or needed order and only if we
> fail or have unmatched tail of pathkey list than we will try to find
> cheapest column order.
> Fixed in v7 (0002-opt_group_by_index_and_order-v7.patch), but may be by
> naive way: if we don't have a path pathkey first try to reorder columns
> accordingly to order by clause. Test for your is also added.
>

OK. I'll give it a try.

>
>> I'm also wondering how freely we can change the group by result
>> ordering. Assume you disable hash aggregate and parallel query -
>> currently we are guaranteed to use group aggregate that produces
>> exactly the ordering as specified in GROUP BY, but this patch removes
>> that "guarantee" and we can produce arbitrary permutation of the
>> ordering. But as I argued in other threads, such implicit guarantees
>> are really fragile, can silently break for arbitrary reasons (say,
>> parallel query will do just that) and can be easily fixed by adding a
>> proper ORDER BY. So I don't think this is an issue.
> Agree. SQL by design doesn't give a warranty of particular order without
> explicit ORDER BY clause.
>
>> The other random thought is how will/could this interact with the
>> incremental sorts patch. I know Alexander wanted to significantly
>> limit where we actually consider the incremental sort, but I'm not
>> sure if this was one of those places or not (is sure looks like a
>> place where we would greatly benefit from it).
> Seems, they should not directly interact. Patch tries to find cheapest
> column order, Alexander's patch tries to make sort cheaper - they are a
> different tasks. But I will try.
>

Thanks.

>> So to summarize this initial review - I do suggest splitting the patch
>> into two parts. One that does the index magic, and one for this
>> reordering optimization. The first part (additional indexes) seems
>> quite fairly safe, likely to get committable soon. The other part
>> (ndistinct reordering) IMHO requires more thought regarding costing
>> and interaction with other query parts.
> Thank you for review!
>

;-)

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-06-06 13:58:16 Re: Spilling hashed SetOps and aggregates to disk
Previous Message Andrew Dunstan 2018-06-06 13:41:27 Re: buildfarm vs code