PATCH: use foreign keys to improve join estimates v1

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: PATCH: use foreign keys to improve join estimates v1
Date: 2015-04-07 01:41:45
Message-ID: 552335D9.3090707@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

attached is a first version of a patch that aims to improve cardinality
estimates of joins by matching foreign keys between the tables (which
was ignored by the planner until now).

This significantly improves estimates when joining two tables using
multi-column conditions, matching a foreign key between the tables.

Consider for example this simple case

CREATE TABLE dim (a int, b int, primary key (a,b));
CREATE TABLE fact (a int, b int, foreign key (a,b) references dim(a,b));

INSERT INTO dim SELECT i,i FROM generate_series(1,1000000) s(i);
INSERT INTO fact SELECT i,i FROM generate_series(1,1000000) s(i);

ANALYZE dim;
ANALYZE fact;

EXPLAIN SELECT * FROM fact f JOIN dim d USING (a,b);

QUERY PLAN
---------------------------------------------------------------------------
Hash Join (cost=29425.00..51350.01 rows=1 width=16)
Hash Cond: ((f.a = d.a) AND (f.b = d.b))
-> Seq Scan on fact f (cost=0.00..14425.00 rows=1000000 width=8)
-> Hash (cost=14425.00..14425.00 rows=1000000 width=8)
-> Seq Scan on dim d (cost=0.00..14425.00 rows=1000000 width=8)
(5 rows)

which is of course completely off, because the query produces 1M rows.

In practice, underestimates like this often cause much more serious
issues in the subsequent steps - for example the next join would
probably be executed as nested loop, which makes sense with a single row
but is an awful choice with 1M rows.

With the patch, the planner realizes there is a matching foreign key,
and tweaks the selectivities in calc_joinrel_size_estimate().

QUERY PLAN
-------------------------------------------------------------------------
Hash Join (cost=29426.25..250323877.62 rows=1000050 width=8)
Hash Cond: ((fact.a = dim.a) AND (fact.b = dim.b))
-> Seq Scan on fact (cost=0.00..14425.50 rows=1000050 width=8)
-> Hash (cost=14425.50..14425.50 rows=1000050 width=8)
-> Seq Scan on dim (cost=0.00..14425.50 rows=1000050 width=8)
(5 rows)

There are a few unsolved questions/issues:

(1) The current patch only does the trick when the FK matches the
conditions perfectly - when there are no missing columns (present
in the FK, not covered by a condition).

I believe this might be relaxed in both directions. When the
conditions don't cover all the FK columns, we know there's at least
one matching row (and we can estimate the number of matches). In
the other direction, we can estimate just the 'extra' conditions.

(2) Adding further conditions may further break the estimates, for
example after adding "WHERE d.a = d.b" this happens

QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=16987.50..33931.50 rows=25 width=8)
Hash Cond: (f.a = d.a)
-> Seq Scan on fact f (cost=0.00..16925.00 rows=5000 width=8)
Filter: (a = b)
-> Hash (cost=16925.00..16925.00 rows=5000 width=8)
-> Seq Scan on dim d (cost=0.00..16925.00 rows=5000 width=8)
Filter: (a = b)
(7 rows)

One issue is that "a=b" condition is poorly estimated due to
correlation (which might be improved by multi-variate stats). It
however removes one of the conditions from the join restrict list,
so it only contains "f.a = d.a" and thus only covers one of the FK
columns, and thus the patch fails to detect the FK :-(

Not sure how to fix this - one way might be performing the check
sooner, before the second join clause is removed (not sure where
that happens). Another option is reconstructing clauses somehow.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-estimation-with-fkeys-v1.patch text/x-patch 19.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sawada Masahiko 2015-04-07 02:22:28 Re: Freeze avoidance of very large table.
Previous Message Simon Riggs 2015-04-07 01:12:26 Re: Auditing extension for PostgreSQL (Take 2)