Re: Columns correlation and adaptive query optimization

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
Cc: Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Columns correlation and adaptive query optimization
Date: 2021-01-25 13:27:25
Message-ID: 318656e2-4d5d-2865-f1ba-a5c705327ac8@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Thank you for review.
My answers are inside.

On 21.01.2021 15:30, Yugo NAGATA wrote:
> Hello,
>
> On Thu, 26 Mar 2020 18:49:51 +0300
> Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
>
>> Attached please find new version of the patch with more comments and
>> descriptions added.
> Adaptive query optimization is very interesting feature for me, so I looked
> into this patch. Here are some comments and questions.
>
> (1)
> This patch needs rebase because clauselist_selectivity was modified to improve
> estimation of OR clauses.

Rebased version is attached.
>
> (2)
> If I understand correctly, your proposal consists of the following two features.
>
> 1. Add a feature to auto_explain that creates an extended statistic automatically
> if an error on estimated rows number is large.
>
> 2. Improve rows number estimation of join results by considering functional
> dependencies between vars in the join qual if the qual has more than one clauses,
> and also functional dependencies between a var in the join qual and vars in quals
> of the inner/outer relation.
>
> As you said, these two parts are independent each other, so one feature will work
> even if we don't assume the other. I wonder it would be better to split the patch
> again, and register them to commitfest separately.

I agree with you that this are two almost unrelated changes, although
without clausesel patch additional statistic can not improve query planning.
But I already have too many patches at commitfest.
May be it will be enough to spit this patch into two?

>
> (3)
> + DefineCustomBoolVariable("auto_explain.suggest_only",
> + "Do not create statistic but just record in WAL suggested create statistics statement.",
> + NULL,
> + &auto_explain_suggest_on
>
> To imply that this parameter is involving to add_statistics_threshold, it seems
> better for me to use more related name like add_statistics_suggest_only.
>
> Also, additional documentations for new parameters are required.

Done.

>
> (4)
> + /*
> + * Prevent concurrent access to extended statistic table
> + */
> + stat_rel = table_open(StatisticExtRelationId, AccessExclusiveLock);
> + slot = table_slot_create(stat_rel, NULL);
> + scan = table_beginscan_catalog(stat_rel, 2, entry);
> (snip)
> + table_close(stat_rel, AccessExclusiveLock);
> + }
>
> When I tested the auto_explain part, I got the following WARNING.
>
> WARNING: buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 2)
> WARNING: buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, flags=0x83000000, refcount=1 1)
> WARNING: relcache reference leak: relation "pg_statistic_ext" not closed
> WARNING: TupleDesc reference leak: TupleDesc 0x7fa439266338 (12029,-1) still referenced
> WARNING: Snapshot reference leak: Snapshot 0x55c332c10418 still referenced
>
> To suppress this, I think we need table_endscan(scan) and
> ExecDropSingleTupleTableSlot(slot) before finishing this function.

Thank you for noticing the problem, fixed.

>
> (6)
> + elog(NOTICE, "Auto_explain suggestion: CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);
>
> We should use ereport instead of elog for log messages.

Changed.

>
> (7)
> + double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
> + if (dep != 0.0)
> + {
> + s1 *= dep + (1 - dep) * s2;
> + continue;
> + }
>
> I found the following comment of clauselist_apply_dependencies():
>
> * we actually combine selectivities using the formula
> *
> * P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)
>
> so, is it not necessary using the same formula in this patch? That is,
>
> s1 *= dep + (1-dep) * s2 (if s1 <= s2)
> s1 *= dep * (s2/s1) + (1-dep) * s2 (otherwise) .
Makes sense.

> (8)
> +/*
> + * Try to find dependency between variables.
> + * var: varaibles which dependencies are considered
> + * join_vars: list of variables used in other clauses
> + * This functions return strongest dependency and some subset of variables from the same relation
> + * or 0.0 if no dependency was found.
> + */
> +static double
> +var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
> +{
>
> The comment mentions join_vars but the actual argument name is clauses_vars,
> so it needs unification.

Fixed.

>
> (9)
> Currently, it only consider functional dependencies statistics. Can we also
> consider multivariate MCV list, and is it useful?

Right now auto_explain create statistic without explicit specification
of statistic kind.
According to the documentation all supported statistics kinds should be
created in this case:

/|statistics_kind|/

A statistics kind to be computed in this statistics object.
Currently supported kinds are |ndistinct|, which enables n-distinct
statistics, and |dependencies|, which enables functional dependency
statistics. If this clause is omitted, all supported statistics
kinds are included in the statistics object. For more information,
see Section 14.2.2
<https://www.postgresql.org/docs/10/planner-stats.html#PLANNER-STATS-EXTENDED>
and Section 68.2
<https://www.postgresql.org/docs/10/multivariate-statistics-examples.html>.

>
> (10)
> To achieve adaptive query optimization (AQO) in PostgreSQL, this patch proposes
> to use auto_explain for getting feedback from actual results. So, could auto_explain
> be a infrastructure of AQO in future? Or, do you have any plan or idea to make
> built-in infrastructure for AQO?
Sorry, I do not have answer for this question.
I just patched auto_explain extension because it is doing  half of the
required work (analyze  expensive statements).
It can be certainly moved to separate extension. In this case it will
party duplicate existed functionality and
settings of auto_explain (like statement execution time threshold). I am
not sure that it is good.
But from the other side, this my patch makes auto_explain extension to
do some unexpected work...

Actually task of adaptive query optimization is much bigger.
We have separate AQO extension which tries to use machine learning to
correctly adjust estimations.
This my patch is much simpler and use existed mechanism (extended
statistics) to improve estimations.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
auto_explain_create_statistic-7.patch text/x-patch 13.7 KB
extended_statistic_join_selectivity-7.patch text/x-patch 7.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Guillaume Lelarge 2021-01-25 13:34:02 Re: Extensions not dumped when --schema is used
Previous Message Greg Nancarrow 2021-01-25 13:14:14 Re: Parallel INSERT (INTO ... SELECT ...)