Re: using extended statistics to improve join estimates

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: using extended statistics to improve join estimates
Date: 2021-10-06 19:33:03
Message-ID: ff412e82-5427-3b72-cf32-4700156dc47c@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

attached is an improved version of this patch, addressing some of the
points mentioned in my last message:

1) Adds a couple regression tests, testing various join cases with
expressions, additional conditions, etc.

2) Adds support for expressions, so the join clauses don't need to
reference just simple columns. So e.g. this can benefit from extended
statistics, when defined on the expressions:

-- CREATE STATISTICS s1 ON (a+1), b FROM t1;
-- CREATE STATISTICS s2 ON (a+1), b FROM t2;

SELECT * FROM t1 JOIN t2 ON ((t1.a + 1) = (t2.a + 1) AND t1.b = t2.b);

3) Can combine extended statistics and regular (per-column) statistics.
The previous version required extended statistics MCV on both sides of
the join, but adding extended statistics on both sides may impractical
(e.g. if one side does not have correlated columns it's silly to have to
add it just to make this patch happy).

For example you may have extended stats on the dimension table but not
the fact table, and the patch still can combine those two. Of course, if
there's no MCV on either side, we can't do much.

So this patch works when both sides have extended statistics MCV, or
when one side has extended MCV and the other side regular MCV. It might
seem silly, but the extended MCV allows considering additional baserel
conditions (if there are any).

examples
========

The table / data is very simple, but hopefully good enough for some
simple examples.

create table t1 (a int, b int, c int);
create table t2 (a int, b int, c int);

insert into t1 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

insert into t2 select mod(i,50), mod(i,50), mod(i,50)
from generate_series(1,1000) s(i);

analyze t1, t2;

First, without extended stats (just the first line of explain analyze,
to keep the message short):

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=400 width=24)
(actual time=5.426..22.678 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.325..19.760 rows=10000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=4800 width=24)
(actual time=5.618..5.639 rows=0 loops=1)

Now, let's create statistics:

create statistics s1 on a, b, c from t1 ;
create statistics s2 on a, b, c from t2 ;
analyze t1, t2;

and now the same queries again:

explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=31.00..106.00 rows=20000 width=24)
(actual time=5.448..22.713 rows=20000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c < 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=28.50..160.75 rows=10000 width=24)
(actual time=5.317..16.680 rows=10000 loops=1)

explain analyze select * from t1 join t2 on (t1.a = t2.a) where t1.c <
25 and t2.c > 25;

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=24.50..104.75 rows=1 width=24)
(actual time=5.647..5.667 rows=0 loops=1)

Those examples are a bit simplistic, but the improvements are fairly
clear I think.

limitations & open issues
=========================

Let's talk about the main general restrictions and open issues in the
current patch that I can think of at the moment.

1) statistics covering all join clauses

The patch requires the statistics to cover all the join clauses, mostly
because it simplifies the implementation. This means that to use the
per-column statistics, there has to be just a single join clause.

AFAICS this could be relaxed and we could use multiple statistics to
estimate the clauses. But it'd make selection of statistics much more
complicated, because we have to pick "matching" statistics on both sides
of the join. So it seems like an overkill, and most joins have very few
conditions anyway.

2) only equality join clauses

The other restriction is that at the moment this only supports simple
equality clauses, combined with AND. So for example this is supported

t1 JOIN t2 ON ((t1.a = t2.a) AND (t1.b + 2 = t2.b + 1))

while these are not:

t1 JOIN t2 ON ((t1.a = t2.a) OR (t1.b + 2 = t2.b + 1))

t1 JOIN t2 ON ((t1.a - t2.a = 0) AND (t1.b + 2 = t2.b + 1))

t1 JOIN t2 ON ((t1.a = t2.a) AND ((t1.b = t2.b) OR (t1.c = t2.c)))

I'm not entirely sure these restrictions can be relaxed. It's not that
difficult to evaluate these cases when matching items between the MCV
lists, similarly to how we evaluate bitmaps for baserel estimates.

But I'm not sure what to do about the part not covered by the MCV lists.
The eqjoinsel() approach uses ndistinct estimates for that, but that
only works for AND clauses, I think. How would that work for OR?

Similarly, I'm not sure we can do much for non-equality conditions, but
those are currently estimated as default selectivity in selfuncs.c.

3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

explain analyze
select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those. I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.

4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-Estimate-joins-using-extended-statistics-20211006.patch text/x-patch 67.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-10-06 19:33:37 Re: BUG #17212: pg_amcheck fails on checking temporary relations
Previous Message Stephen Frost 2021-10-06 19:29:48 Re: Role Self-Administration