Re: PATCH: use foreign keys to improve join estimates v1

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: use foreign keys to improve join estimates v1
Date: 2015-08-20 01:49:46
Message-ID: 55D5323A.9060503@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

attached is a significantly reworked patch for using the foreign keys in
selectivity estimation. The previous patch only really worked if the
clauses matched the foreign key perfectly (i.e. no additional join
clauses) - this patch attempts to relax those restrictions a bit.

This patch also significantly improves the comments - the best place to
start reading is clauselist_join_selectivity().

In general, the patch works by looking for foreign keys between the
inner and outer side of the join, but only for keys that:

(a) have more than 2 keys (as this only really helps with multi-
column foreign keys)

(b) are 'fully matched' by the join clauses, i.e. there's a clause
exactly matching each part of the foreign key

Once we have matched foreign keys (for each join), we can estimate each
of them using 1/cardinality of the referenced table and estimate the
remaining clauses (not used to match any foreign key) the old way.

example with 3 tables
---------------------
create table a (a1 int, a2 int, primary key (a1, a2));

create table b (b1 int, b2 int, primary key (b1, b2));

create table f (f1 int, f2 int, f3 int, f4 int,
foreign key (f1,f2) references a(a1,a2),
foreign key (f3,f4) references b(b1,b2));

insert into a select i, i from generate_series(1,10000) s(i);
insert into b select i, i from generate_series(1,10000) s(i);

-- 10x
insert into f select i, i, i, i from generate_series(1,10000) s(i);

analyze;

Then on current master, I get these estimates (showing just rows,
because that's what matter):

while with the patch I get this:

select * from f join a on (f1 = a1 and f2 = a2);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (rows=100000) (actual rows=100000)
Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
-> Seq Scan on f (rows=100000) (actual rows=100000)
-> Hash (rows=10000) (actual rows=10000)
Buckets: 16384 Batches: 1 Memory Usage: 519kB
-> Seq Scan on a (rows=10000) (actual rows=10000)

select * from f join a on (f1 = a1 and f2 = a2)
join b on (f3 = b1 and f4 = b2);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (rows=100000) (actual rows=100000)
Hash Cond: ((f.f3 = b.b1) AND (f.f4 = b.b2))
-> Hash Join (rows=100000) (actual rows=100000)
Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
-> Seq Scan on f (rows=100000) (actual rows=100000)
-> Hash (rows=10000) (actual rows=10000)
Buckets: 16384 Batches: 1 Memory Usage: 519kB
-> Seq Scan on a (rows=10000) (actual rows=10000)
-> Hash (rows=10000) (actual rows=10000)
Buckets: 16384 Batches: 1 Memory Usage: 519kB
-> Seq Scan on b (rows=10000) (actual rows=10000)

So, that's pretty good.

I believe it might be possible to use even foreign keys matched only
partially (e.g. foreign key on 3 columns, but only 2 of those matched by
clauses), but I think that's a bit too much for now.

The examples above are rather trivial, and sadly it's not difficult to
break them. For example by adding a single additional join clause to the
first query does this:

select * from f join a on (f1 = a1 and f2 = a2 and f1 = a2);

QUERY PLAN
------------------------------------------------------------------------
Hash Join (rows=2) (actual rows=100000)
Hash Cond: (f.f1 = a.a1)
-> Seq Scan on f (rows=500) (actual rows=100000)
Filter: (f1 = f2)
-> Hash (rows=50) (rows=10000)
Buckets: 16384 (originally 1024) Batches: 1 (originally 1) ...
-> Seq Scan on a (rows=50) (actual rows=10000)
Filter: (a1 = a2)

This of course happens "thanks to" equality classes because the planner
is smart enough to realize that (f1=f2=a1=a2) thanks to the new clause.
I'm not sure how to fix this, and maybe this particular case would be
easier to fix using the multivariate stats (once it can estimate clauses
like a1=a2).

Similarly, the equality classes may break other examples by deriving
completely new clauses - for example assume the "f" table is defined
like this (again, with 100k rows):

create table f (f1 int, f2 int,
foreign key (f1,f2) references a(a1,a2),
foreign key (f1,f2) references b(b1,b2));

then this query

select * from f join a on (f1 = a1 and f2 = a2)
join b on (f1 = b1 and f2 = b2);

may get planned like this:

QUERY PLAN
------------------------------------------------------------------------
Hash Join (rows=100000) (actual rows=100000)
Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
-> Seq Scan on f (rows=100000) (actual rows=100000)
-> Hash (rows=1) (actual rows=10000)
-> Hash Join (rows=1) (actual rows=10000)
Hash Cond: ((a.a1 = b.b1) AND (a.a2 = b.b2))
-> Seq Scan on a (rows=10000) (actual rows=10000)
-> Hash (rows=10000) (actual rows=10000)
-> Seq Scan on b (rows=10000) (actual rows=10000)

because the planner derived that (a1=b1 and a2=b2) by looking at both
join clauses, and joined 'a' and 'b' first. And of course, there are no
foreign keys between these two tables (effectively dimensions), so the
patch can do nothing about the selectivities.

I'm not sure how serious problem this really is in practice - those
examples are constructed to show the issue. I also can't find a good
place to address this - either by tweaking the estimates before the
equality classes are processed, or somehow after.

It however illustrates that with this patch the user would be able to
influence the planner - either intentionally or by accident. For example
what should happen if someone creates the same foreign key twice? Should
we detect it and only count it once into the selectivity, or what?

There's a bunch of similar cases mentioned in the comment before
clauselist_join_selectivity.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-estimation-with-fkeys-v2.patch text/x-diff 29.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-08-20 02:07:47 Re: DBT-3 with SF=20 got failed
Previous Message Peter Geoghegan 2015-08-20 01:21:41 Re: Our trial to TPC-DS but optimizer made unreasonable plan