Re: using extended statistics to improve join estimates

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: using extended statistics to improve join estimates
Date: 2021-03-31 23:47:14
Message-ID: CALNJ-vSOptX3bF_gDE+fgb0=3FFDRAnP5QpCq7cW-cw5ua0J4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

+ * has_matching_mcv
+ * Check whether the list contains statistic of a given kind

The method name is find_matching_mcv(). It seems the method initially
returned bool but later the return type was changed.

+ StatisticExtInfo *found = NULL;

found normally is associated with bool return value. Maybe call the
variable matching_mcv or something similar.

+ /* skip items eliminated by restrictions on rel2 */
+ if (conditions2 && !conditions2[j])
+ continue;

Maybe you can add a counter recording the number of non-skipped items for
the inner loop. If this counter is 0 after the completion of one iteration,
we come out of the outer loop directly.

Cheers

On Wed, Mar 31, 2021 at 10:36 AM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> Hi,
>
> So far the extended statistics are applied only at scan level, i.e. when
> estimating selectivity for individual tables. Which is great, but joins
> are a known challenge, so let's try doing something about it ...
>
> Konstantin Knizhnik posted a patch [1] using functional dependencies to
> improve join estimates in January. It's an interesting approach, but as
> I explained in that thread I think we should try a different approach,
> similar to how we use MCV lists without extended stats. We'll probably
> end up considering functional dependencies too, but probably only as a
> fallback (similar to what we do for single-relation estimates).
>
> This is a PoC demonstrating the approach I envisioned. It's incomplete
> and has various limitations:
>
> - no expression support, just plain attribute references
> - only equality conditions
> - requires MCV lists on both sides
> - inner joins only
>
> All of this can / should be relaxed later, of course. But for a PoC this
> seems sufficient.
>
> The basic principle is fairly simple, and mimics what eqjoinsel_inner
> does. Assume we have a query:
>
> SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)
>
> If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
> same logic as eqjoinsel_inner and "match" them together. If the MCV list
> is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
> the general idea is the same.
>
> To demonstrate this, consider a very simple example with a table that
> has a lot of dependency between the columns:
>
> ==================================================================
>
> CREATE TABLE t (a INT, b INT, c INT, d INT);
> INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
> FROM generate_series(1,100000) s(i);
> ANALYZE t;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
>
> CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
> ANALYZE t;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
>
> ALTER STATISTICS s SET STATISTICS 10000;
> ANALYZE t;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);
>
> ==================================================================
>
> The results look like this:
>
> - actual rows: 100000000
> - estimated (no stats): 1003638
> - estimated (stats, 100): 100247844
> - estimated (stats, 10k): 100000000
>
> So, in this case the extended stats help quite a bit, even with the
> default statistics target.
>
> However, there are other things we can do. For example, we can use
> restrictions (at relation level) as "conditions" to filter the MCV lits,
> and calculate conditional probability. This is useful even if there's
> just a single join condition (on one column), but there are dependencies
> between that column and the other filters. Or maybe when there are
> filters between conditions on the two sides.
>
> Consider for example these two queries:
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
> WHERE t1.c < 25 AND t2.d < 25;
>
> SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
> WHERE t1.c < 25 AND t2.d > 75;
>
> In this particular case we know that (a = b = c = d) so the two filters
> are somewhat redundant. The regular estimates will ignore that, but with
> MCV we can actually detect that - when we combine the two MCV lists, we
> essentially calculate MCV (a,b,t1.c,t2.d), and use that.
>
> Q1 Q2
> - actual rows: 25000000 0
> - estimated (no stats): 62158 60241
> - estimated (stats, 100): 25047900 1
> - estimated (stats, 10k): 25000000 1
>
> Obviously, the accuracy depends on how representative the MCV list is
> (what fraction of the data it represents), and in this case it works
> fairly nicely. A lot of the future work will be about handling cases
> when it represents only part of the data.
>
> The attached PoC patch has a number of FIXME and XXX, describing stuff I
> ignored to keep it simple, possible future improvement. And so on.
>
>
> regards
>
>
> [1]
>
> https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc1b2(at)postgrespro(dot)ru
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2021-03-31 23:49:49 Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Previous Message 'alvherre@alvh.no-ip.org' 2021-03-31 23:20:55 Re: libpq debug log