Re: PoC/WIP: Extended statistics on expressions

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PoC/WIP: Extended statistics on expressions
Date: 2021-03-17 15:55:41
Message-ID: CAEZATCUooH7uPsPkMJEyf-KY_3Om7TTVo7Fo31JiXXgGCb=97g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, 7 Mar 2021 at 21:10, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> 2) ndistinct
>
> There's one thing that's bugging me, in how we handle "partial" matches.
> For each expression we track both the original expression and the Vars
> we extract from it. If we can't find a statistics matching the whole
> expression, we try to match those individual Vars, and we remove the
> matching ones from the list. And in the end we multiply the estimates
> for the remaining Vars.
>
> This works fine with one matching ndistinct statistics. Consider for example
>
> GROUP BY (a+b), (c+d)
>
> with statistics on [(a+b),c] - that is, expression and one column.

I've just been going over this example, and I think it actually works
slightly differently from how you described, though I haven't worked
out the full general implications of that.

> We parse the expressions into two GroupExprInfo
>
> {expr: (a+b), vars: [a, b]}
> {expr: (c+d), vars: [c, d]}
>

Here, I think what you actually get, in the presence of stats on
[(a+b),c] is actually the following two GroupExprInfos:

{expr: (a+b), vars: []}
{expr: (c+d), vars: [c, d]}

because of the code immediately after this comment in estimate_num_groups():

/*
* If examine_variable is able to deduce anything about the GROUP BY
* expression, treat it as a single variable even if it's really more
* complicated.
*/

As it happens, that makes no difference in this case, where there is
just this one stats object, but it does change things when there are
two stats objects.

> and the statistics matches the first item exactly (the expression). The
> second expression is not in the statistics, but we match "c". So we end
> up with an estimate for "(a+b), c" and have one remaining GroupExprInfo:
>
> {expr: (c+d), vars: [d]}

Right.

> Without any other statistics we estimate that as ndistinct for "d", so
> we end up with
>
> ndistinct((a+b), c) * ndistinct(d)
>
> which mostly makes sense. It assumes ndistinct(c+d) is product of the
> ndistinct estimates, but that's kinda what we've been always doing.

Yes, that appears to be what happens, and it's probably the best that
can be done with the available stats.

> But now consider we have another statistics on just (c+d). In the second
> loop we end up matching this expression exactly, so we end up with
>
> ndistinct((a+b), c) * ndistinct((c+d))

In this case, with stats on (c+d) as well, the two GroupExprInfos
built at the start change to:

{expr: (a+b), vars: []}
{expr: (c+d), vars: []}

so it actually ends up not using any multivariate stats, but instead
uses the two univariate expression stats, giving

ndistinct((a+b)) * ndistinct((c+d))

which actually seems pretty good, and gave very good estimates in the
simple test case I tried.

> i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think
> what we should do after the first loop is just discarding the whole
> expression and "expand" into per-variable GroupExprInfo, so in the
> second step we would not match the (c+d) statistics.

Not using the (c+d) stats would give either

ndistinct((a+b)) * ndistinct(c) * ndistinct(d)

or

ndistinct((a+b), c) * ndistinct(d)

depending on exactly how the algorithm was changed. In my test, these
both gave worse estimates, but there are probably other datasets for
which they might do better. It all depends on how much correlation
there is between (a+b) and c.

I suspect that there is no optimal strategy for handling overlapping
stats that works best for all datasets, but the current algorithm
seems to do a pretty decent job.

> Of course, maybe there's a better way to pick the statistics, but I
> think our conclusion so far was that people should just create
> statistics covering all the columns in the query, to not have to match
> multiple statistics like this.

Yes, I think that's always likely to work better, especially for
ndistinct stats, where all possible permutations of subsets of the
columns are included, so a single ndistinct stat can work well for a
range of different queries.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-03-17 16:01:38 Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Previous Message Bruce Momjian 2021-03-17 15:40:23 Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?