Re: Columns correlation and adaptive query optimization

From: Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: Zhihong Yu <zyu(at)yugabyte(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Columns correlation and adaptive query optimization
Date: 2021-03-22 10:29:36
Message-ID: 20210322192936.8753103766ab20059f1bb7cf@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Konstantin,

I tested this patch as a statistics advisor using TPC-H queries.
The used parameters are:

auto_explain.add_statistics_suggest_only = on
auto_explain.add_statistics_threshold = 0.1
auto_explain.log_analyze = on
auto_explain.log_min_duration = 0

auto_explain suggested to create a few extented statistics for some
queries, but I could not find performance improvement with the "join
selectivity estimation using extended statistics" patch. I think this
is because there are no such correlation in TPC-H dataset that the
extended statistics can help, though.

During this test, I came up with a additional comments.

1)
As found in TPC-H test, suggested extended statistics may not be useful
for improving performance. Therefore, to decide to adopt it or not, it
would be useful if we could get information about "why we require it" or
"why this is suggested" as DETAIL or HINT. For example, we may show a
line or snippet of EXPLAIN result as the reason of the suggestion.

2)
For Q21 of TPC-H, the following extended statistics were suggested.

NOTICE: Auto_explain suggestion: CREATE STATISTICS lineitem_l_commitdate_l_receiptdate ON l_commitdate, l_receiptdate FROM lineitem
NOTICE: Auto_explain suggestion: CREATE STATISTICS lineitem_l_commitdate_l_receiptdate_l_suppkey ON l_commitdate, l_receiptdate, l_suppkey FROM lineitem

The latter's target columns includes the former's, so I am not sure
we need both of them. (Which we should adopt may be up to on administrator,
though.)

3)
For Q22 of TPC-H, the following two same extended statistics were suggested.

NOTICE: Auto_explain suggestion: CREATE STATISTICS customer_c_acctbal_c_phone ON c_acctbal, c_phone FROM customer
NOTICE: Auto_explain suggestion: CREATE STATISTICS customer_c_acctbal_c_phone ON c_acctbal, c_phone FROM customer

So, when we set add_statistics_suggest_only to off, we get the following error:

ERROR: duplicate key value violates unique constraint "pg_statistic_ext_name_index"
DETAIL: Key (stxname, stxnamespace)=(customer_c_acctbal_c_phone, 2200) already exists.

Regards,
Yugo Nagata

On Sat, 20 Mar 2021 12:41:44 +0300
Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> wrote:

>
>
> On 19.03.2021 20:32, Zhihong Yu wrote:
> > Hi,
> > In AddMultiColumnStatisticsForQual(),
> >
> > +   /* Loop until we considered all vars */
> > +   while (vars != NULL)
> > ...
> > +       /* Contruct list of unique vars */
> > +       foreach (cell, vars)
> >
> > What if some cell / node, gets into the else block:
> >
> > +               else
> > +               {
> > +                   continue;
> >
> > and being left in vars. Is there a chance for infinite loop ?
> > It seems there should be a bool variable indicating whether any cell
> > gets to the following:
> >
> > +           vars = foreach_delete_current(vars, cell);
> >
> > If no cell gets removed in the current iteration, the outer while loop
> > should exit.
>
> Each iteration of outer loop (while (vars != NULL))
> process variables belonging to one relation.
> We take "else continue" branch only if variable belongs to some other
> relation.
> At first iteration of foreach (cell, vars)
> variable "cols" is NULL and we always take first branch of the if.
> In other words, at each iteration of outer loop we always make some
> progress in processing "vars" list and remove some elements
> from this list. So infinite loop can never happen.
>
> >
> > Cheers
> >
> > On Fri, Mar 19, 2021 at 9:58 AM Konstantin Knizhnik
> > <k(dot)knizhnik(at)postgrespro(dot)ru <mailto:k(dot)knizhnik(at)postgrespro(dot)ru>> wrote:
> >
> >
> >
> > On 19.03.2021 12:17, Yugo NAGATA wrote:
> > > On Wed, 10 Mar 2021 03:00:25 +0100
> > > Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com
> > <mailto:tomas(dot)vondra(at)enterprisedb(dot)com>> wrote:
> > >
> > >> What is being proposed here - an extension suggesting which
> > statistics
> > >> to create (and possibly creating them automatically) is certainly
> > >> useful, but I'm not sure I'd call it "adaptive query
> > optimization". I
> > >> think "adaptive" means the extension directly modifies the
> > estimates
> > >> based on past executions. So I propose calling it maybe "statistics
> > >> advisor" or something like that.
> > > I am also agree with the idea to implement this feature as a new
> > > extension for statistics advisor.
> > >
> > >> BTW Why is "qual" in
> > >>
> > >>    static void
> > >>    AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
> > >>
> > >> declared as "void *"? Shouldn't that be "List *"?
> > > When I tested this extension using TPC-H queries, it raised
> > segmentation
> > > fault in this function. I think the cause would be around this
> > argument.
> > >
> > > Regards,
> > > Yugo Nagata
> > >
> > Attached please find new version of the patch with
> > AddMultiColumnStatisticsForQual parameter type fix and one more fix
> > related with handling synthetic attributes.
> > I can not reproduce the crash on TPC-H queries, so if the problem
> > persists, can you please send me stack trace and may be some other
> > information helping to understand the reason of SIGSEGV?
> >
> > Thanks in advance,
> > Konstantin
> >
>

--
Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message ikedamsh 2021-03-22 11:25:45 Re: About to add WAL write/fsync statistics to pg_stat_wal view
Previous Message Greg Nancarrow 2021-03-22 10:27:03 Re: Parallel INSERT (INTO ... SELECT ...)