Re: POC: GROUP BY optimization

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: GROUP BY optimization
Date: 2018-06-06 21:57:20
Message-ID: 4a76e22a-045f-a083-1bc2-84ee90fdb662@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/06/2018 11:22 PM, Claudio Freire wrote:
> On Wed, Jun 6, 2018 at 5:43 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>>>>> For example, it seems to disregard that different data types have
>>>>>> different comparison costs. For example comparing bytea will be far
>>>>>> more expensive compared to int4, so it may be much more efficient to
>>>>>> compare int4 columns first, even if there are far fewer distinct
>>>>>> values in them.
>>>>> as I can see cost_sort() doesn't pay attention to this details. And
>>>>> it should be a separated patch to improve that.
>>>>>
>>>>
>>>> But sort also does not reorder columns.
>>> Yes. But estimation of cost of comparing function/number of unique
>>> values in column could be not very accurate and so planner could make
>>> a wrong choice.
>
> ...
>>
>>>> Imagine you have a custom data type that is expensive for comparisons.
>>>> You know that, so you place it at the end of GROUP BY clauses, to
>>>> reduce the number of comparisons on that field. And then we come along
>>>> and just reorder the columns, placing it first, because it happens to
>>>> have a high ndistinct statistic. And there's no way to get the
>>>> original behavior :-(
>>> Hm. that it could be, but I don't know how to improve here. Current
>>> cost_sort() will return the same cost for any columns order.
>>>
>>> Okay, here we know an estimation of nrow, we could somehow find a
>>> comparing function oid and find a its procost field. And then? we also
>>> need to know width of tuple here and mostly repeat the cost_sort.
>>>
>>> Another option is an introducing enable_groupby_reorder GUC variable
>>> although personally I don't like an idea to add new GUC variable.
>>>
>>
>> I don't like the idea of yet another GUC either, as it's rather blunt
>> instrument. Which is why I'm suggesting to split the patch into two
>> parts. Then we can apply the index stuff (which seems relatively
>> straightforward) and keep working on this second part.
>>
>> FWIW I think it would be useful to have "development GUC" that would
>> allow us to enable/disable these options during development, because
>> that makes experiments much easier. But then remove them before commit.
>
> Correct me if I'm wrong, but doesn't this patch concern itself with
> precisely sort performance?
>

Yes, the second part (reordering pathkeys by ndistinct) does.

> As such, estimating sort performance correctly in the various plan
> variants being considered seems to be a very central aspect of it.
>
> This means it's pretty much time to consider the effect of operator
> cost *and* distinct values in the cost computation.
>

Yes, until now that was not really needed because the optimizer does not
consider reordering of the pathkeys - it simply grabs either GROUP BY or
ORDER BY pathkeys and runs with that.

So the costing was fairly trivial, we simply do something like

comparison_cost = 2.0 * cpu_operator_cost;

sort_cost = comparison_cost * tuples * LOG2(tuples);

which essentially ignores that there might be multiple columns, or that
the columns may have sort operator with different costs.

The question is how reliable the heuristics can be. The current patch
uses just plain ndistinct, but that seems rather unreliable but I don't
have a clear idea how to improve that - we may have MCV for the columns
and perhaps some extended statistics, but I'm not sure how far we can
run with that.

Essentially what we need to estimate the number of comparisons for each
column, to compute better comparison_cost.

>
> Assuming cost_sort does its thing, it's just a matter of building the
> desired variants and picking the one with the smallest cost_sort. One
> can use heuristics to build a subset of all possible column orderings
> with a guiding principle that tries to maximize the likelihook of
> finding a better order than the one in the query (like sorting by
> ndistinct), but I'd suggest:
>
> - Always considering the sort order verbatim as given in the SQL (ie:
> what the user requests)
> - Relying on cost_sort to distinguish among them, and pick a winner, and
> - When in doubt (if cost_sort doesn't produce a clear winner), use the
> order given by the user
>

I don't see why not to generate all possible orderings (keeping only the
cheapest path for each pathkey variant) and let the optimizer to sort it
out. If the user specified an explicit ORDER BY, we'll slap an extra
Sort by at the end, increasing the cost.

> Comparison cost can be approximated probabilistically as:
>
> cost_comp = sum(cost_op_n * (1.0 / ndistinct(col_1_to_n)))
>
> Where cost_op_n is the cost of the comparison function for column N,
> and ndistinct(col_1_to_n) is an estimation of the number of distinct
> values for columns prior to N in the sort order.
>
> You can approximate ndistinct as the product of ndistinct of previous
> columns, or use extended statistics when available.
>

Sure. The trouble is this also assumes uniform distribution, which is
not always the case.

> I think that should give a good enough approximation of
> ndistinct-enriched sort costs that considers operator cost
> appropriately. That operator cost is basically an average and can be
> used as a constant, so it's just a matter of passing the right
> comparison_cost to cost_sort.
>
> Even if ndistinct estimates are off, the fact that operator cost is
> involved should be a good enough tool for the user should the
> planner pick the wrong path - it's only a matter of boosting operator
> cost to let the planner know that sorting that way is expensive.
>

There are far to many "should" in these two paragraphs.

> Priorization of the user-provided order can be as simple as giving
> that comparison_cost a small handicap.

I see no point in doing that, and I don't recall a single place in the
planner where we do that. If the user specified ORDER BY, we'll slap an
explicit Sort on top when needed (which acts as the handicap, but in a
clear way). Otherwise we don't do such things - it'd be just plain
confusing (consider "ORDER BY a,b" vs. "ORDER BY b,c" with same data
types, ndistinct etc. but unexpectedly different costs). Also, what
would be a good value for the handicap?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-06-06 22:04:17 Re: SCRAM with channel binding downgrade attack
Previous Message Steven Fackler 2018-06-06 21:53:36 Re: Supporting tls-server-end-point as SCRAM channel binding for OpenSSL 1.0.0 and 1.0.1