Re: PATCH: use foreign keys to improve join estimates v1

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: use foreign keys to improve join estimates v1
Date: 2015-12-24 02:45:29
Message-ID: CAB7nPqTP63KXqFZ3w2M+14y=ZEv1tU-X_fEDJo+25f-qTPaKvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 30, 2015 at 10:12 AM, David Rowley
<david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> On 29 September 2015 at 01:59, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> wrote:
>>
>> Hi,
>>
>> On 09/27/2015 02:00 PM, David Rowley wrote:
>>>
>>> I've been working on this again. I've put back the code that you wrote
>>> for the looping over each combination of relations from either side of
>>> the join.
>>>
>>> I've also added some code to get around the problem with eclass joins
>>> and the RestrictInfo having some alternative Vars that don't belong to
>>> the foreign key. Basically I'm just checking if the RestrictInfo has a
>>> parent_ec, and if it does just loop over the members to try and find the
>>> Vars that belong to the foreign key. I've tested it with the following,
>>> and it seems to work:
>>
>>
>> I didn't have time to look into the code yet, but this seems like an
>> interesting idea.
>>
>>
>>>
>>> create table a as select i as a_id1, i as a_id2, i as dummy1 from
>>> generate_series(0,999) s(i);
>>> alter table a add unique (a_id1, a_id2);
>>> create table b as select i as b_id1, i as b_id2 from
>>> generate_series(0,332) s(i);
>>>
>>> analyze a;
>>> analyze b;
>>>
>>> alter table b add foreign key (b_id1, b_id2) references a (a_id1, a_id2);
>>>
>>> explain analyze select * from a inner join b on a.dummy1 = b.b_id1 and
>>> a.a_id2 = b.b_id2 where a.a_id1 = a.dummy1;
>>>
>>> QUERY PLAN
>>>
>>> -----------------------------------------------------------------------------------------------------------
>>> Hash Join (cost=18.57..26.41 rows=2 width=20) (actual
>>> time=0.775..1.046 rows=333 loops=1)
>>> Hash Cond: ((b.b_id1 = a.dummy1) AND (b.b_id2 = a.a_id2))
>>> -> Seq Scan on b (cost=0.00..5.33 rows=333 width=8) (actual
>>> time=0.013..0.046 rows=333 loops=1)
>>> -> Hash (cost=18.50..18.50 rows=5 width=12) (actual
>>> time=0.737..0.737 rows=1000 loops=1)
>>> Buckets: 1024 Batches: 1 Memory Usage: 51kB
>>> -> Seq Scan on a (cost=0.00..18.50 rows=5 width=12) (actual
>>> time=0.014..0.389 rows=1000 loops=1)
>>> Filter: (dummy1 = a_id1)
>>>
>>> The non-patched version estimates 1 row. The patched estimates 2 rows,
>>> but that's due to the bad estimate on dummy1 = a_id1.
>>>
>>> The 2 comes from ceil(5 * 0.333).
>>>
>>> Perhaps you have a better test case to for this?
>>
>>
>> I think the additional WHERE clause is needlessly confusing. I've been
>> able to come up with an example - pretty much a normalized with a "main"
>> table and auxiliary tables (referencing the main one using FK) with
>> additional info. So not unlikely to happen in practice (except maybe for the
>> multi-column foreign key bit).
>>
>>
>> CREATE TABLE f (id1 INT, id2 INT, PRIMARY KEY (id1, id2));
>>
>> CREATE TABLE d1 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
>> f(id1, id2));
>> CREATE TABLE d2 (id1 INT, id2 INT, FOREIGN KEY (id1, id2) REFERENCES
>> f(id1, id2));
>>
>> INSERT INTO f SELECT i, i FROM generate_series(1,1000000) s(i);
>>
>> INSERT INTO d1 SELECT i, i FROM generate_series(1,100000) s(i);
>> INSERT INTO d2 SELECT i, i FROM generate_series(1,300000) s(i);
>>
>> now, both pair-wise joins (f JOIN d1) and (f JOIN d2) are estimated
>> perfectly accurately, but as soon as the query involves both of them, this
>> happens:
>>
>> SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND f.id2 = d1.id2)
>> JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);
>>
>> QUERY PLAN
>> -------------------------------------------------------------------------
>> Nested Loop (cost=3334.43..12647.57 rows=30000 width=24)
>> (actual time=221.086..1767.206 rows=100000 loops=1)
>> Join Filter: ((d1.id1 = f.id1) AND (d1.id2 = f.id2))
>> -> Hash Join (cost=3334.00..12647.01 rows=1 width=16)
>> (actual time=221.058..939.482 rows=100000 loops=1)
>> Hash Cond: ((d2.id1 = d1.id1) AND (d2.id2 = d1.id2))
>> -> Seq Scan on d2 (cost=0.00..4328.00 rows=300000 width=8)
>> (actual time=0.038..263.356 rows=300000 loops=1)
>> -> Hash (cost=1443.00..1443.00 rows=100000 width=8)
>> (actual time=220.721..220.721 rows=100000 loops=1)
>> Buckets: 131072 Batches: 2 Memory Usage: 2982kB
>> -> Seq Scan on d1 (cost=0.00..1443.00 rows=100000 ...)
>> (actual time=0.033..101.547 rows=100000 loops=1)
>> -> Index Only Scan using f_pkey on f (cost=0.42..0.54 rows=1 ...)
>> (actual time=0.004..0.004 rows=1 loops=100000)
>> Index Cond: ((id1 = d2.id1) AND (id2 = d2.id2))
>> Heap Fetches: 100000
>>
>> Clearly, the inner join (d1 JOIN d2) is poorly estimated (1 vs. 100000). I
>> assume that's only because we find FK only on the second join with f.
>>
>> So it seems like s a clear improvement, both compared to master and the
>> previous versions of the patch.
>
>
> I've been experimenting with this example. Of course, the reason why we get
> the 1 row estimate on the join between d1 and d2 is that there's no foreign
> key between those two relations.
>
> The attached patch changes things so that the foreign key matching code is
> better able to see foreign keys "hidden" behind eclasses. So it does now in
> fact detect a foreign key on d2 referencing d1, by looking for Vars foreign
> keys which have Vars in the same eclasses as the joinquals are built from.
> This has improved the result
>
> postgres=# EXPLAIN ANALYZE SELECT * FROM f JOIN d1 ON (f.id1 = d1.id1 AND
> f.id2 = d1.id2) JOIN d2 ON (f.id1 = d2.id1 AND f.id2 = d2.id2);
> QUERY
> PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=16655.94..26066.95 rows=30000 width=24) (actual
> time=267.322..468.383 rows=100000 loops=1)
> Hash Cond: ((d2.id1 = f.id1) AND (d2.id2 = f.id2))
> -> Seq Scan on d2 (cost=0.00..4328.00 rows=300000 width=8) (actual
> time=0.019..31.396 rows=300000 loops=1)
> -> Hash (cost=14666.94..14666.94 rows=100000 width=16) (actual
> time=266.263..266.263 rows=100000 loops=1)
> Buckets: 131072 Batches: 2 Memory Usage: 3373kB
> -> Merge Join (cost=9748.32..14666.94 rows=100000 width=16)
> (actual time=104.494..224.908 rows=100000 loops=1)
> Merge Cond: ((f.id1 = d1.id1) AND (f.id2 = d1.id2))
> -> Index Only Scan using f_pkey on f (cost=0.42..36214.93
> rows=1000000 width=8) (actual time=0.045..35.758 rows=100001 loops=1)
> Heap Fetches: 100001
> -> Sort (cost=9747.82..9997.82 rows=100000 width=8) (actual
> time=104.440..122.401 rows=100000 loops=1)
> Sort Key: d1.id1, d1.id2
> Sort Method: external sort Disk: 2152kB
> -> Seq Scan on d1 (cost=0.00..1443.00 rows=100000
> width=8) (actual time=0.019..9.443 rows=100000 loops=1)
>
> The problem is that the code I added is sometimes a bit too optimistic at
> finding a suitable foreign key. When performing estimates for the join
> between (f,d1) <-> (d2), since the code loops over each relation making up
> the set of relations at either side of the join, we find a foreign key on
> 'f' which references d2, this one actually exists. It then goes on and also
> finds a foreign key for (d1) references (d2), of course this one does not
> exists and it's only could due to the eclasses. The problem here is, which
> one do we use? If we multiply the selectivity for each of these foreign keys
> then we'd end up with a selectivty = (1.0 / 1000000) * (1.0 / 300000), which
> is a massive underestimation. Perhaps doing this would be perfectly valid if
> the actual foreign key being around was not the same one as the last one,
> but this seems wrong when we match to the same foreign key in both
> instances.
>
> I've gone though a few variations on ways to handle this and I'm a bit stuck
> on what's the best way.
>
> In the attached I've coded it to take the Min() selectivity for when the
> same quals are matched more than once. I know this is not correct, but since
> it seems impossible to obtain an exact estimate in this case, we'd need to
> decide on some logic which does well in the average case.

Is there still an interest for this patch? The last message of this
thread has added a new version of the patch but the patch was still in
"Waiting on author" state for a couple of months. Just guessing that
the status was incorrect, I have moved it to next CF.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2015-12-24 02:46:09 Re: A Typo in regress/sql/privileges.sql
Previous Message Michael Paquier 2015-12-24 02:42:17 Re: Performance improvement for joins where outer side is unique