Re: PATCH: use foreign keys to improve join estimates v1

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: use foreign keys to improve join estimates v1
Date: 2015-09-30 01:12:11
Message-ID: CAKJS1f_K3ON13bmpyBwRd=RAPRPN_ju47vSA8GzLk_kNa=smWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 29 September 2015 at 01:59, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> Hi,
>
> On 09/27/2015 02:00 PM, David Rowley wrote:
>
>> I've been working on this again. I've put back the code that you wrote
>> for the looping over each combination of relations from either side of
>> the join.
>>
>> I've also added some code to get around the problem with eclass joins
>> and the RestrictInfo having some alternative Vars that don't belong to
>> the foreign key. Basically I'm just checking if the RestrictInfo has a
>> parent_ec, and if it does just loop over the members to try and find the
>> Vars that belong to the foreign key. I've tested it with the following,
>> and it seems to work:
>>
>
> I didn't have time to look into the code yet, but this seems like an
> interesting idea.
>
>
>
>> create table a as select i as a_id1, i as a_id2, i as dummy1 from
>> generate_series(0,999) s(i);
>> alter table a add unique (a_id1, a_id2);
>> create table b as select i as b_id1, i as b_id2 from
>> generate_series(0,332) s(i);
>>
>> analyze a;
>> analyze b;
>>
>> alter table b add foreign key (b_id1, b_id2) references a (a_id1, a_id2);
>>
>> explain analyze select * from a inner join b on a.dummy1 = b.b_id1 and
>> a.a_id2 = b.b_id2 where a.a_id1 = a.dummy1;
>>
>> QUERY PLAN
>>
>> -----------------------------------------------------------------------------------------------------------
>> Hash Join (cost=18.57..26.41 rows=2 width=20) (actual
>> time=0.775..1.046 rows=333 loops=1)
>> Hash Cond: ((b.b_id1 = a.dummy1) AND (b.b_id2 = a.a_id2))
>> -> Seq Scan on b (cost=0.00..5.33 rows=333 width=8) (actual
>> time=0.013..0.046 rows=333 loops=1)
>> -> Hash (cost=18.50..18.50 rows=5 width=12) (actual
>> time=0.737..0.737 rows=1000 loops=1)
>> Buckets: 1024 Batches: 1 Memory Usage: 51kB
>> -> Seq Scan on a (cost=0.00..18.50 rows=5 width=12) (actual
>> time=0.014..0.389 rows=1000 loops=1)
>> Filter: (dummy1 = a_id1)
>>
>> The non-patched version estimates 1 row. The patched estimates 2 rows,
>> but that's due to the bad estimate on dummy1 = a_id1.
>>
>> The 2 comes from ceil(5 * 0.333).
>>
>> Perhaps you have a better test case to for this?
>>
>
> I think the additional WHERE clause is needlessly confusing. I've been
> able to come up with an example - pretty much a normalized with a "main"
> table and auxiliary tables (referencing the main one using FK) with
> additional info. So not unlikely to happen in practice (except maybe for
> the multi-column foreign key bit).
>
>
> CREATE TABLE f (id1 INT, id2 INT, PRIMARY KEY (id1, id2));
>
> CREATE TABLE d1 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
> f(id1, id2));
> CREATE TABLE d2 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
> f(id1, id2));
>
> INSERT INTO f SELECT i, i FROM generate_series(1,1000000) s(i);
>
> INSERT INTO d1 SELECT i, i FROM generate_series(1,100000) s(i);
> INSERT INTO d2 SELECT i, i FROM generate_series(1,300000) s(i);
>
> now, both pair-wise joins (f JOIN d1) and (f JOIN d2) are estimated
> perfectly accurately, but as soon as the query involves both of them, this
> happens:
>
> SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND f.id2 = d1.id2)
> JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);
>
> QUERY PLAN
> -------------------------------------------------------------------------
> Nested Loop (cost=3334.43..12647.57 rows=30000 width=24)
> (actual time=221.086..1767.206 rows=100000 loops=1)
> Join Filter: ((d1.id1 = f.id1) AND (d1.id2 = f.id2))
> -> Hash Join (cost=3334.00..12647.01 rows=1 width=16)
> (actual time=221.058..939.482 rows=100000 loops=1)
> Hash Cond: ((d2.id1 = d1.id1) AND (d2.id2 = d1.id2))
> -> Seq Scan on d2 (cost=0.00..4328.00 rows=300000 width=8)
> (actual time=0.038..263.356 rows=300000 loops=1)
> -> Hash (cost=1443.00..1443.00 rows=100000 width=8)
> (actual time=220.721..220.721 rows=100000 loops=1)
> Buckets: 131072 Batches: 2 Memory Usage: 2982kB
> -> Seq Scan on d1 (cost=0.00..1443.00 rows=100000 ...)
> (actual time=0.033..101.547 rows=100000 loops=1)
> -> Index Only Scan using f_pkey on f (cost=0.42..0.54 rows=1 ...)
> (actual time=0.004..0.004 rows=1 loops=100000)
> Index Cond: ((id1 = d2.id1) AND (id2 = d2.id2))
> Heap Fetches: 100000
>
> Clearly, the inner join (d1 JOIN d2) is poorly estimated (1 vs. 100000). I
> assume that's only because we find FK only on the second join with f.
>
> So it seems like s a clear improvement, both compared to master and the
> previous versions of the patch.

I've been experimenting with this example. Of course, the reason why we get
the 1 row estimate on the join between d1 and d2 is that there's no foreign
key between those two relations.

The attached patch changes things so that the foreign key matching code is
better able to see foreign keys "hidden" behind eclasses. So it does now in
fact detect a foreign key on d2 referencing d1, by looking for Vars foreign
keys which have Vars in the same eclasses as the joinquals are built from.
This has improved the result

postgres=# EXPLAIN ANALYZE SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND
f.id2 = d1.id2) JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=16655.94..26066.95 rows=30000 width=24) (actual
time=267.322..468.383 rows=100000 loops=1)
Hash Cond: ((d2.id1 = f.id1) AND (d2.id2 = f.id2))
-> Seq Scan on d2 (cost=0.00..4328.00 rows=300000 width=8) (actual
time=0.019..31.396 rows=300000 loops=1)
-> Hash (cost=14666.94..14666.94 rows=100000 width=16) (actual
time=266.263..266.263 rows=100000 loops=1)
Buckets: 131072 Batches: 2 Memory Usage: 3373kB
-> Merge Join (cost=9748.32..14666.94 rows=100000 width=16)
(actual time=104.494..224.908 rows=100000 loops=1)
Merge Cond: ((f.id1 = d1.id1) AND (f.id2 = d1.id2))
-> Index Only Scan using f_pkey on f (cost=0.42..36214.93
rows=1000000 width=8) (actual time=0.045..35.758 rows=100001 loops=1)
Heap Fetches: 100001
-> Sort (cost=9747.82..9997.82 rows=100000 width=8)
(actual time=104.440..122.401 rows=100000 loops=1)
Sort Key: d1.id1, d1.id2
Sort Method: external sort Disk: 2152kB
-> Seq Scan on d1 (cost=0.00..1443.00 rows=100000
width=8) (actual time=0.019..9.443 rows=100000 loops=1)

The problem is that the code I added is sometimes a bit too optimistic at
finding a suitable foreign key. When performing estimates for the join
between (f,d1) <-> (d2), since the code loops over each relation making up
the set of relations at either side of the join, we find a foreign key on
'f' which references d2, this one actually exists. It then goes on and also
finds a foreign key for (d1) references (d2), of course this one does not
exists and it's only could due to the eclasses. The problem here is, which
one do we use? If we multiply the selectivity for each of these foreign
keys then we'd end up with a selectivty = (1.0 / 1000000) * (1.0 / 300000),
which is a massive underestimation. Perhaps doing this would be perfectly
valid if the actual foreign key being around was not the same one as the
last one, but this seems wrong when we match to the same foreign key in
both instances.

I've gone though a few variations on ways to handle this and I'm a bit
stuck on what's the best way.

In the attached I've coded it to take the Min() selectivity for when the
same quals are matched more than once. I know this is not correct, but
since it seems impossible to obtain an exact estimate in this case, we'd
need to decide on some logic which does well in the average case.

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
estimation-with-fkeys-v2_davidv4.patch application/octet-stream 23.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2015-09-30 01:24:22 Re: No Issue Tracker - Say it Ain't So!
Previous Message Matt Newell 2015-09-30 00:26:11 Re: LISTEN denial of service with aborted transaction