Quick Links

Re: Columns correlation and adaptive query optimization

From:	Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To:	Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
Cc:	Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Columns correlation and adaptive query optimization
Date:	2021-01-27 16:39:00
Message-ID:	1220e592-d50d-c015-890c-67c2e8a4e001@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 27.01.2021 8:45, Yugo NAGATA wrote:
> On Mon, 25 Jan 2021 16:27:25 +0300
> Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
>
>> Hello,
>>
>> Thank you for review.
>> My answers are inside.
> Thank you for updating the patch and answering my questions.
>
>>> (2)
>>> If I understand correctly, your proposal consists of the following two features.
>>>
>>> 1. Add a feature to auto_explain that creates an extended statistic automatically
>>> if an error on estimated rows number is large.
>>>
>>> 2. Improve rows number estimation of join results by considering functional
>>> dependencies between vars in the join qual if the qual has more than one clauses,
>>> and also functional dependencies between a var in the join qual and vars in quals
>>> of the inner/outer relation.
>>>
>>> As you said, these two parts are independent each other, so one feature will work
>>> even if we don't assume the other. I wonder it would be better to split the patch
>>> again, and register them to commitfest separately.
>> I agree with you that this are two almost unrelated changes, although
>> without clausesel patch additional statistic can not improve query planning.
> I think extended statistics created by the auto_explain patch can improve query
> planning even without the clausesel patch. For example, suppose the following case:
>
> postgres=# create table t ( i int, j int);
> CREATE TABLE
> postgres=# insert into t select i/10, i/100 from generate_series(1,1000000) i;
> INSERT 0 1000000
> postgres=# analyze t;
> ANALYZE
> postgres=# explain analyze select * from t where i = 100 and j = 10;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------
> Seq Scan on t (cost=0.00..19425.00 rows=1 width=8) (actual time=0.254..97.293 rows=10 loops=1)
> Filter: ((i = 100) AND (j = 10))
> Rows Removed by Filter: 999990
> Planning Time: 0.199 ms
> Execution Time: 97.327 ms
> (5 rows)
>
> After applying the auto_explain patch (without clausesel patch) and issuing the query,
> additional statistics were created.
>
> postgres=# select * from t where i = 100 and j = 10;
> LOG: Add statistics t_i_j
>
> Then, after analyze, the row estimation was improved.
>
> postgres=# analyze t;
> ANALYZE
> postgres=# explain analyze select * from t where i = 100 and j = 10;
> postgres=# explain analyze select * from t where i = 100 and j = 10;
> QUERY PLAN
> --------------------------------------------------------------------------------------------------
> Seq Scan on t (cost=0.00..19425.00 rows=10 width=8) (actual time=0.255..95.347 rows=10 loops=1)
> Filter: ((i = 100) AND (j = 10))
> Rows Removed by Filter: 999990
> Planning Time: 0.124 ms
> Execution Time: 95.383 ms
> (5 rows)
>
> So, I think the auto_explain patch is useful with just that as a tool
> to detect a gap between estimate and real and adjust the plan. Also,
> the clausesel patch would be useful without the auto_explain patch
> if an appropriate statistics are created.
>
>> But I already have too many patches at commitfest.
>> May be it will be enough to spit this patch into two?
> Although we can discuss both of these patches in this thread,
> I wonder we don't have to commit them together.
>
>>> (3)
>>> + DefineCustomBoolVariable("auto_explain.suggest_only",
>>> + "Do not create statistic but just record in WAL suggested create statistics statement.",
>>> + NULL,
>>> + &auto_explain_suggest_on
>>>
>>> To imply that this parameter is involving to add_statistics_threshold, it seems
>>> better for me to use more related name like add_statistics_suggest_only.
>>>
>>> Also, additional documentations for new parameters are required.
>> Done.
> +
> + <varlistentry>
> + <term>
> + <varname>auto_explain.auto_explain.add_statistics_threshold</varname> (<type>real</type>)
> + <indexterm>
> + <primary><varname>auto_explain.add_statistics_threshold</varname> configuration parameter</primary>
> + </indexterm>
> + </term>
> + <listitem>
> + <para>
> + <varname>auto_explain.add_statistics_threshold</varname> sets the threshold for
> + actual/estimated #rows ratio triggering creation of multicolumn statistic
> + for the related columns. It can be used for adpative query optimization.
> + If there is large gap between real and estimated number of tuples for the
> + concrete plan node, then multicolumn statistic is created for involved
> + attributes. Zero value (default) disables implicit creation of multicolumn statistic.
> + </para>
> + </listitem>
>
> I wonder we need to say that this parameter has no effect unless log_analyze
> is enabled and that statistics are created only when the excution time exceeds
> log_min_duration, if these behaviors are intentional.
>
> In addition, additional statistics are created only if #rows is over-estimated
> and not if it is under-estimated. Although it seems good as a criterion for creating
> multicolumn statistic since extended statisstic is usually useful to fix over-estimation,
> I am not sure if we don't have to consider under-estimation case at all.
>
>
>>> (9)
>>> Currently, it only consider functional dependencies statistics. Can we also
>>> consider multivariate MCV list, and is it useful?
>>
>> Right now auto_explain create statistic without explicit specification
>> of statistic kind.
>> According to the documentation all supported statistics kinds should be
>> created in this case:
> Yes, auto_explain creates all kinds of extended statistics. However,
> IIUC, the clausesel patch uses only functional dependencies statistics for
> improving join, so my question was about possibility to consider MCV in the
> clausesel patch.
>
>>> (10)
>>> To achieve adaptive query optimization (AQO) in PostgreSQL, this patch proposes
>>> to use auto_explain for getting feedback from actual results. So, could auto_explain
>>> be a infrastructure of AQO in future? Or, do you have any plan or idea to make
>>> built-in infrastructure for AQO?
>> Sorry, I do not have answer for this question.
>> I just patched auto_explain extension because it is doing half of the
>> required work (analyze expensive statements).
>> It can be certainly moved to separate extension. In this case it will
>> party duplicate existed functionality and
>> settings of auto_explain (like statement execution time threshold). I am
>> not sure that it is good.
>> But from the other side, this my patch makes auto_explain extension to
>> do some unexpected work...
> I think that auto_explain is an extension originally for aiming to detect
> and log plans that take a long time, so it doesn't seem so unnatural for
> me to use this for improving such plans. Especially, the feature to find
> tunable points in executed plans seems useful.
>
>> Actually task of adaptive query optimization is much bigger.
>> We have separate AQO extension which tries to use machine learning to
>> correctly adjust estimations.
>> This my patch is much simpler and use existed mechanism (extended
>> statistics) to improve estimations.
> Well, this patch provide a kind of AQO as auto_explain feature, but this
> is independent of the AQO extension. Is it right?
> Anyway, I'm interested in the AQO extension, so I'll look into this, too.
>
> Regards,
> Yugo Nagata
>

I have updated documentation as you suggested and submit patch for
auto_explain extension for the next commitfest.
I will create separate thread for improving join selectivity estimation
using extended statistics.

> Well, this patch provide a kind of AQO as auto_explain feature, but this
> is independent of the AQO extension. Is it right?
Yes. The basic idea is the same, but approaches are different.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
auto_explain_create_statistic-8.patch	text/x-patch	14.0 KB

In response to

Re: Columns correlation and adaptive query optimization at 2021-01-27 05:45:17 from Yugo NAGATA

Responses

Re: Columns correlation and adaptive query optimization at 2021-03-10 02:00:25 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2021-01-27 16:41:36	Re: [PoC] Non-volatile WAL buffer
Previous Message	Anastasia Lubennikova	2021-01-27 16:32:42	Re: pg_upgrade fails with non-standard ACL