Re: Columns correlation and adaptive query optimization

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Columns correlation and adaptive query optimization
Date: 2020-03-26 14:43:49
Message-ID: e0474c11-cafe-1535-813e-15dae6b48365@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you very much for review.

On 25.03.2020 20:04, Rafia Sabih wrote:
>
> +static void
> +AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
> +
>
> This doesn't look like the right place for it, you might want to
> declare it with other functions in the starting of the file.
>
> Also, there is no description about any of the functions here,
> wouldn’t hurt having some more comments there.

Sorry, I will fix it.
Actually this patch contains of two independent parts:
first allows to use auto_explain extension to generate mutlicolumn
statistic for variables used in clauses
for which selectivity estimation gives wrong result. It affects only
auto_explain extension.

Second part allows to use multicolumn statistic for join selectivity
estimation.
As far as I know extended statistic is now actively improved:
https://www.postgresql.org/message-id/flat/20200309000157.ig5tcrynvaqu4ixd%40development#bfbdf9c41c31ef92819dfc5ecde4a67c

I think that using extended statistic for join selectivity is very
important and should also be addressed.
If my approach is on so good, I will be pleased for other suggestions.

>
> A few of more questions that cross my mind at this point,
>
> - have you tried measuring the extra cost we have to pay for this
> mores statistics , and also compare it with the benefit it gives in
> terms of accuracy.
Adding statistic not always leads to performance improvement but I never
observed any performance degradation caused by presence of extended
statistic.
Definitely we can manually create too many extended statistic entries
for different subsets of columns.
And it certainly increase planning time because optimizer has to
consider more alternatives.
But in practice I never noticed such slowdown.

> - I would also be interested in understanding if there are cases when
> adding this extra step doesn’t help and have you excluded them already
> or if some of them are easily identifiable at this stage...?

Unfortunately there are many cases when extended statistic can not help.
Either because optimizer is not able to use it (for example my patch
consider only cases with strict equality comparison,
but if you use predicate like "a.x=b.x and  a.y in (1,2,3)"  then
extended statistic for <x,y> can not be used.
Either because collected statistic itself is not precise enough ,
especially in case of data skews.

> - is there any limit on the number of columns for which this will
> work, or should there be any such limit...?
>
Right now there is limit for maximal number of columns used in extended
statistic: 8 columns.
But in practice I rarely see join predicates involving more than 3 columns.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-03-26 15:04:57 Re: error context for vacuum to include block number
Previous Message Julien Rouhaud 2020-03-26 14:40:09 Re: Patch: to pass query string to pg_plan_query()