Re: PATCH: use foreign keys to improve join estimates v1

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: use foreign keys to improve join estimates v1
Date: 2015-06-08 07:22:20
Message-ID: CAKJS1f91RSjkY571XL09nokjuK849ALBrCORLSAh=dMDc7bJFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19 May 2015 at 11:08, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> On 05/17/15 14:31, David Rowley wrote:
>
>>
>> I think if you find any quals that are a part of *any* foreign key
>> between the 2 join tables, then you should be never assume these quals
>> to reduce the number of rows. I believe this should be the case even if
>> you don't fully match all columns of the foreign key.
>>
>> If we modify your example a little, let's say your foreign key between
>> fact and dim is made from 3 columns (a,b,c)
>>
>> If we do:
>>
>> EXPLAIN SELECT * FROM fact f JOIN dim d USING (a,b);
>>
>> Then we should always (under normal circumstances) find at least one
>> matching row, although in this case since the join qual for c is
>> missing, we could find more than 1 matching row.
>>
>> Without digging too deep here, I'd say that the best way to do this
>> would be to either have calc_joinrel_size_estimate() build a list of
>> restrictinfo items of all quals that are NOT part of any foreign key and
>> pass that trimmed list down to clauselist_selectivity() for selectivity
>> estimates. Or perhaps a better way would be just to
>> teach clauselist_selectivity() about foreign keys. Likely
>> clauselist_selectivity() would just have to skip over RestrictInfo items
>> that are part of a foreign key.
>>
>
> I'm not particularly happy about the way the current patch tweaks the
> selectivity, but I think simply removing the clauses is not going to work,
> because that implies the (removed) conditions have selectivity 1.0 (so the
> estimate would match a cartesian product). So we really need to estimate
> the selectivity, we can't just throw them away.
>
>
Of course I should have said 1.0 / <estimated rows>

> And that's essentially what the current patch does - it realizes the
> clauses match a FK, and then computes the estimate as 1/N where N is the
> cardinality of the table with PK.
>
> Another thing is that there may be multiple "candidate" foreign keys, and
> we can't just remove clauses matching at least one of them. Imagine e.g. a
> (somewhat artificial) example:
>
> CREATE TABLE parent (
> a INT,
> b INT,
> c INT,
> d INT,
> PRIMARY KEY (a,b),
> UNIQUE (c,d)
> );
>
> CREATE TABLE child (
> a INT,
> b INT,
> c INT,
> d INT,
> FOREIGN KEY (a,b) REFERENCES parent (a,b),
> FOREIGN KEY (c,d) REFERENCES parent (c,b)
> );
>
> and a join on (a,b,c,d). We can't just discard all the conditions, because
> (a,b) and (c,d) may 'mismatch'. We know (a,b) and (c,d) matches about 1/N
> of the 'parent' table, but we don't know selectivity for the whole join
> condition.
>

Ok how about we ignore partial foreign key matches and just get things
working when additional quals exist that are not a part of the foreign key.

I think here it would just be a matter of just deciding on which foreign
key to use the quals for and which ones to estimate the old fashioned way.

How about we have a function:

List *
get_non_foreign_key_quals(PlannerInfo *root, List *restrictlist) /* XXX
think of a better name */

which will be called from within calc_joinrel_size_estimate()

This function will simply take the list that's currently being sent off
to clauselist_selectivity() and it would search for the biggest possible
subset of foreign key columns from within that list. The returned list
would be what we would pass to clauselist_selectivity(), so if nothing
matched we'd just do the normal thing.

Let's say we have a restrictlist like in the join you describe above...
I'll use p for parent, and c for the child table:

{p.a = c.a}, {p.b = c.b}, {p.c = c.c}, {p.d = c.d}

get_non_foreign_key_quals would parse that list looking for foreign keys
going in either direction from both of these relations. It would look for
the "longest chain" of quals that match a foreign key and return the quals
which couldn't be matched. This "couldn't match" list would be what would
be returned to clauselist_selectivity(), the return value from that
function would have to be multiplied with.... I think 1.0 / min(<est inner
rows>, <est outer rows>). Though I've not quite gotten my head around how
outer joins change that, but I think that's ok for inner... Perhaps the
divide depends on which relation the fk is on... I need to think a bit more
to get my head around that...

I think you could implement this by generating a List of Bitmapset, pseudo
code along the lines of:

List *
get_non_foreign_key_quals(PlannerInfo *root, List *restrictlist) /* XXX
think of a better name */
{
List *fkey_matches;
foreach (foreign key in list)
{
Bitmapset matches;
if (foreign key is not for this rel)
continue;

foreach (qual in fkey)
{
int which_qual = 1;
foreach (qual in restrict list)
{
if (fkey qual matches ri qual)
matches = bmp_add_member(matches, which_qual);
which_qual++;
}
}
fkey_matches = lappend(fkey_matches, matches);
}

int longest_match = -1
int longest_match_idx;
int idx = 0;
foreach (match in fkey_matches)
{
int nmembers = bms_num_members(match);
/* only change the match it beats a previous match
* as we want to take the first longest match */
if (nmembers > longest_match)
{
longest_match = nmembers;
longest_match_idx = idx;
}
idx++;
}

if (longest_match == list_length(restrictlist))
return NULL; /* everything matches a fkey */
else
{
List *non_fk_list;
int which_qual = 1;
foreach (qual in restrict_list)
{
if (!bms_is_member(longest_match, which_qual))
non_fk_list = lappend(non_fk_list, qual);
which_qual++;
}
return non_fk_list;
}
}

Now in your example there's two foreign keys both of which will match 2
different quals each, this means that we have two different equally long
chains. I don't propose we do anything more complex than take the first of
the equal length chains. Although perhaps we'll want something stable...
perhaps in order of constraint name or something, but for a first cut I
highly doubt this matters. (See qsort() at the bottom
of CheckConstraintFetch() for example of what I mean)

> And it may be more complex, e.g. there may be two overlapping FKeys, e.b.
> (a,b) and (b,c) - how do you split that?
>
> But this may be an overkill (multiple multi-column FK join), although if
> we could handle that reasonably, then why not ... someone out there
> certainly has schema like that ;-)
>
> What I think is somewhat more realistic is conditions matching only a
> subset of FK columns - for example with a FK on (a,b) the join may only use
> (a), for example. Then again - we can't just discard that, we need to
> estimate that (and use it to compute the actual selectivity).
>
> I agree that if we want to do anything fancy with this, we will have to
> teach clauselist_selectivity() about foreign keys.
>
>
A few other things:

+ /* TODO Can we do something like (hasindex) here? Is it necessary? */
+ if (true)

I wondered about this too in my join removals stuff. I ended up adding a
pg_class flag for this. This was Tom's response:

http://www.postgresql.org/message-id/18703.1410450446@sss.pgh.pa.us

Note that the unsetting of relhaspkey is not so great. It only happens to
get unset if the primary key is dropped and no other indexes of any type
exist on the relation and the table is vacuumed. So I think I understand
why Tom didn't want more flags that can end up being wrongly set.

+bool
+rel_has_unique_index(PlannerInfo *root, RelOptInfo *rel)

root is not used by this function.

+ for (i = 0; i < fkinfo->nkeys; i++)
+ match &= matches[i];
+
+ break;

This looks wrong. Why use break? This seems to be broken if there's more
than 1 foreign key between the two relations, as if the first of the two
does not match then it looks like you'll just return false... I've not
tested but that's the way the code seems to read.

Something like:

+ match = true;
+
+ for (i = 0; i < fkinfo->nkeys; i++)
+ {
+ if (!matches[i])
+ {
+ matched = false;
+ break;
+ }
+ }
+
+ /* each fkey column has been matched to the correct index column */
+ if (matched)
+ return true;
+ } /* foreach (lc, rel->fkeylist) */
+
+ return false; /* no matching fkey found */

With this we'll try the next fkey in the list if we didn't find a match.

Also I'm pretty sure it's just your temporary workings... but
in join_matches_fkey() you're returning 0, 1, or 2 from a bool function.
perhaps this won't be required anymore if my get_non_foreign_key_quals()
idea works out.

Thoughts?

Regards

David Rowley

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Cédric Villemain 2015-06-08 07:45:06 Re: checkpointer continuous flushing
Previous Message Andres Freund 2015-06-08 06:52:57 Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1