Re: Performance improvement for joins where outer side is unique

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Performance improvement for joins where outer side is unique
Date: 2015-12-16 22:40:47
Message-ID: CAKJS1f_i7dGh9ytc0DuaX3JJaN7Gp9AGXNbdtWjj0PkPzNzUeg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 17 December 2015 at 05:02, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> 0) I know the patch does not tweak costing - any plans in this
>
direction? Would it be possible to simply use the costing used by
> semijoin?
>
>
Many thanks for looking at this.

The patch does tweak the costings so that unique joins are costed in the
same way as semi joins. It's a very small change, so easy to miss.

For example, see the change in initial_cost_nestloop()

- if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+ if (jointype == JOIN_SEMI ||
+ jointype == JOIN_ANTI ||
+ unique_inner)
{

Also see the changes in final_cost_nestloop() and final_cost_hashjoin()

1) nodeHashjoin.c (and other join nodes)
>
> I've noticed we have this in the ExecHashJoin() method:
>
> /*
> * When the inner side is unique or we're performing a
> * semijoin, we'll consider returning the first match, but
> * after that we're done with this outer tuple.
> */
> if (node->js.unique_inner)
> node->hj_JoinState = HJ_NEED_NEW_OUTER;
>
> That seems a bit awkward because the comment speaks about unique joins
> *OR* semijoins, but the check only references unique_inner. That of course
> works because we do this in ExecInitHashJoin():
>
> switch (node->join.jointype)
> {
> case JOIN_SEMI:
> hjstate->js.unique_inner = true;
> /* fall through */
>
> Either some semijoins may not be unique - in that case the naming is
> misleading at best, and we either need to find a better name or simply
> check two conditions like this:
>
> if (node->js.unique_inner || node->join.type == JOIN_SEMI)
> node->hj_JoinState = HJ_NEED_NEW_OUTER;
>
> Or all semijoins are unique joins, and in that case the comment may need
> rephrasing. But more importantly, it begs the question why we're detecting
> this in the executor and not in the planner? Because if we detect it in
> executor, we can't use this bit in costing, for example.
>
>
The reason not to detect that in the planner is simply that unique_join is
meant to mean that the relation on the inner side of the join will at most
contain a single Tuple which matches each outer relation tuple, based on
the join condition. I've added no detection for this in semi joins, and I'd
rather not go blindly setting the flag to true in the planner as it simply
may not be true for the semi join. At the moment that might not matter as
we're only using the unique_join flag as an optimization in the join nodes,
but I'd prefer not to do this as its likely we'll want to do more with this
flag later, and I'd rather we keep the meaning well defined. You might
argue that this is also true for semi joins, but if down the road somewhere
we want to perform some checks on the inner relation before the join takes
place, and in that case the Tuples of the relation might not have the same
properties we claim they do.

But you're right that reusing the flag in the join nodes is not ideal, and
the comment is not that great either. I'd really rather go down the path of
either renaming the variable, or explaining this better in the comment. It
seems unnecessary to check both for each tuple being joined. I'd imagine
that might add a little overhead to joins which are not unique.

How about changing the comment to:

/*
* In the event that the inner side has been detected to be
* unique, as an optimization we can skip searching for any
* subsequent matching inner tuples, as none will exist.
* For semijoins unique_inner will always be true, although
* in this case we don't match another inner tuple as this
* is the required semi join behavior.
*/

Alternatively or additionally we can rename the variable in the executor
state, although I've not thought of a name yet that I don't find overly
verbose: unique_inner_or_semi_join, match_first_tuple_only.

2) analyzejoins.c
>
> I see that this code in join_is_removable()
>
> innerrel = find_base_rel(root, innerrelid);
>
> if (innerrel->reloptkind != RELOPT_BASEREL)
> return false;
>
> was rewritten like this:
>
> innerrel = find_base_rel(root, innerrelid);
>
> Assert(innerrel->reloptkind == RELOPT_BASEREL);
>
> That suggests that previously the function expected cases where reloptkind
> may not be RELOPT_BASEREL, but now it'll error out int such cases. I
> haven't noticed any changes in the surrounding code that'd guarantee this
> won't happen, but I also haven't been able to come up with an example
> triggering the assert (haven't been trying too hard). How do we know the
> assert() makes sense?
>
>
I'd have changed this as this should be covered by the if
(!sjinfo->is_unique_join || a few lines up. mark_unique_joins() only sets
is_unique_join to true if specialjoin_is_unique_join() returns true and
that function tests reloptkind to ensure its set to RELOPT_BASEREL and
return false if the relation is not. Perhaps what is missing is a comment
to explain the function should only be used on RELOPT_BASEREL type
relations.

>
> 3) joinpath.c
>
> - either "were" or "been" seems redundant
>
> /* left joins were already been analyzed for uniqueness in
> mark_unique_joins() */
>
>
Good catch. Fixed

>
> 4) analyzejoins.c
>
> comment format broken
>
> /*
> * mark_unique_joins
> Analyze joins in order to determine if their inner side is unique
> based
> on the join condition.
> */
>
>
Fixed

> 5) analyzejoins.c
>
> missing line before the comment
>
> }
> /*
> * rel_supports_distinctness
> * Returns true if rel has some properties which can prove
> the relation
> * to be unique over some set of columns.
> *
>

Fixed

I've attached updated patches.

Thanks again for the review!

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
unique_joins_17-12-2015_d7b399e.patch application/octet-stream 75.9 KB
unique_joins_17-12-2015_delta.patch application/octet-stream 4.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-12-17 00:01:40 Re: Remove array_nulls?
Previous Message Simon Riggs 2015-12-16 22:28:15 Re: Reducing ClogControlLock contention