Re: using extended statistics to improve join estimates

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: using extended statistics to improve join estimates
Date: 2021-10-06 21:03:02
Message-ID: CALNJ-vRTL-5qjYrNEDkS=xg_mbDT6y98XGgD8jrJzcPLRSMSuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 6, 2021 at 12:33 PM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> Hi,
>
> attached is an improved version of this patch, addressing some of the
> points mentioned in my last message:
>
> 1) Adds a couple regression tests, testing various join cases with
> expressions, additional conditions, etc.
>
> 2) Adds support for expressions, so the join clauses don't need to
> reference just simple columns. So e.g. this can benefit from extended
> statistics, when defined on the expressions:
>
> -- CREATE STATISTICS s1 ON (a+1), b FROM t1;
> -- CREATE STATISTICS s2 ON (a+1), b FROM t2;
>
> SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);
>
> 3) Can combine extended statistics and regular (per-column) statistics.
> The previous version required extended statistics MCV on both sides of
> the join, but adding extended statistics on both sides may impractical
> (e.g. if one side does not have correlated columns it's silly to have to
> add it just to make this patch happy).
>
> For example you may have extended stats on the dimension table but not
> the fact table, and the patch still can combine those two. Of course, if
> there's no MCV on either side, we can't do much.
>
> So this patch works when both sides have extended statistics MCV, or
> when one side has extended MCV and the other side regular MCV. It might
> seem silly, but the extended MCV allows considering additional baserel
> conditions (if there are any).
>
>
> examples
> ========
>
> The table / data is very simple, but hopefully good enough for some
> simple examples.
>
> create table t1 (a int, b int, c int);
> create table t2 (a int, b int, c int);
>
> insert into t1 select mod(i,50), mod(i,50), mod(i,50)
> from generate_series(1,1000) s(i);
>
> insert into t2 select mod(i,50), mod(i,50), mod(i,50)
> from generate_series(1,1000) s(i);
>
> analyze t1, t2;
>
> First, without extended stats (just the first line of explain analyze,
> to keep the message short):
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Hash Join (cost=31.00..106.00 rows=400 width=24)
> (actual time=5.426..22.678 rows=20000 loops=1)
>
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Hash Join (cost=28.50..160.75 rows=10000 width=24)
> (actual time=5.325..19.760 rows=10000 loops=1)
>
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
> 25 and t2.c > 25;
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Hash Join (cost=24.50..104.75 rows=4800 width=24)
> (actual time=5.618..5.639 rows=0 loops=1)
>
>
> Now, let's create statistics:
>
> create statistics s1 on a, b, c from t1 ;
> create statistics s2 on a, b, c from t2 ;
> analyze t1, t2;
>
> and now the same queries again:
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Hash Join (cost=31.00..106.00 rows=20000 width=24)
> (actual time=5.448..22.713 rows=20000 loops=1)
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Hash Join (cost=28.50..160.75 rows=10000 width=24)
> (actual time=5.317..16.680 rows=10000 loops=1)
>
>
> explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
> 25 and t2.c > 25;
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Hash Join (cost=24.50..104.75 rows=1 width=24)
> (actual time=5.647..5.667 rows=0 loops=1)
>
>
> Those examples are a bit simplistic, but the improvements are fairly
> clear I think.
>
>
> limitations & open issues
> =========================
>
> Let's talk about the main general restrictions and open issues in the
> current patch that I can think of at the moment.
>
> 1) statistics covering all join clauses
>
> The patch requires the statistics to cover all the join clauses, mostly
> because it simplifies the implementation. This means that to use the
> per-column statistics, there has to be just a single join clause.
>
> AFAICS this could be relaxed and we could use multiple statistics to
> estimate the clauses. But it'd make selection of statistics much more
> complicated, because we have to pick "matching" statistics on both sides
> of the join. So it seems like an overkill, and most joins have very few
> conditions anyway.
>
>
> 2) only equality join clauses
>
> The other restriction is that at the moment this only supports simple
> equality clauses, combined with AND. So for example this is supported
>
> t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1))
>
> while these are not:
>
> t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1))
>
> t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1))
>
> t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c)))
>
> I'm not entirely sure these restrictions can be relaxed. It's not that
> difficult to evaluate these cases when matching items between the MCV
> lists, similarly to how we evaluate bitmaps for baserel estimates.
>
> But I'm not sure what to do about the part not covered by the MCV lists.
> The eqjoinsel() approach uses ndistinct estimates for that, but that
> only works for AND clauses, I think. How would that work for OR?
>
> Similarly, I'm not sure we can do much for non-equality conditions, but
> those are currently estimated as default selectivity in selfuncs.c.
>
>
> 3) estimation by join pairs
>
> At the moment, the estimates are calculated for pairs of relations, so
> for example given a query
>
> explain analyze
> select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
> join t3 on (t1.b = t3.b and t2.c = t3.c);
>
> we'll estimate the first join (t1,t2) just fine, but then the second
> join actually combines (t1,t2,t3). What the patch currently does is it
> splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this
> should actually combine all three MCVs at once - we're pretty much
> combining the MCVs into one large MCV representing the join result.
>
> But I haven't done that yet, as it requires the MCVs to be combined
> using the join clauses (overlap in a way), but I'm not sure how likely
> that is in practice. In the example it could help, but that's a bit
> artificial example.
>
>
> 4) still just inner equi-joins
>
> I haven't done any work on extending this to outer joins etc. Adding
> outer and semi joins should not be complicated, mostly copying and
> tweaking what eqjoinsel() does.
>
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
Hi,

+ conditions2 = statext_determine_join_restrictions(root, rel, mcv);
+
+ /* if the new statistics covers more conditions, use it */
+ if (list_length(conditions2) > list_length(conditions1))
+ {
+ mcv = stat;

It seems conditions2 is calculated using mcv, I wonder why mcv is replaced
by stat (for conditions1 whose length is shorter) ?

Cheers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Borisov 2021-10-06 21:27:54 Re: BUG #17212: pg_amcheck fails on checking temporary relations
Previous Message Bruce Momjian 2021-10-06 20:51:57 Re: ALTER INDEX .. RENAME allows to rename tables/views as well