Re: Performance improvement for joins where outer side is unique

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Performance improvement for joins where outer side is unique
Date: 2015-12-17 00:17:17
Message-ID: 5671FF0D.2010403@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 12/16/2015 11:40 PM, David Rowley wrote:
> On 17 December 2015 at 05:02, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
> <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
> 0) I know the patch does not tweak costing - any plans in this
>
> direction? Would it be possible to simply use the costing used by
> semijoin?
>
>
> Many thanks for looking at this.
>
> The patch does tweak the costings so that unique joins are costed in the
> same way as semi joins. It's a very small change, so easy to miss.

Thanks. I missed that bit somehow.

>
> For example, see the change in initial_cost_nestloop()
>
> - if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
> + if (jointype == JOIN_SEMI ||
> + jointype == JOIN_ANTI ||
> + unique_inner)
> {
>
> Also see the changes in final_cost_nestloop() and final_cost_hashjoin()
>
>
> 1) nodeHashjoin.c (and other join nodes)
>
> I've noticed we have this in the ExecHashJoin() method:
>
> /*
> * When the inner side is unique or we're performing a
> * semijoin, we'll consider returning the first match, but
> * after that we're done with this outer tuple.
> */
> if (node->js.unique_inner)
> node->hj_JoinState = HJ_NEED_NEW_OUTER;
>
> That seems a bit awkward because the comment speaks about unique
> joins *OR* semijoins, but the check only references unique_inner.
> That of course works because we do this in ExecInitHashJoin():
>
> switch (node->join.jointype)
> {
> case JOIN_SEMI:
> hjstate->js.unique_inner = true;
> /* fall through */
>
> Either some semijoins may not be unique - in that case the naming is
> misleading at best, and we either need to find a better name or
> simply check two conditions like this:
>
> if (node->js.unique_inner || node->join.type == JOIN_SEMI)
> node->hj_JoinState = HJ_NEED_NEW_OUTER;
>
> Or all semijoins are unique joins, and in that case the comment may
> need rephrasing. But more importantly, it begs the question why
> we're detecting this in the executor and not in the planner? Because
> if we detect it in executor, we can't use this bit in costing, for
> example.
>
>
> The reason not to detect that in the planner is simply that unique_join
> is meant to mean that the relation on the inner side of the join will at
> most contain a single Tuple which matches each outer relation tuple,
> based on the join condition. I've added no detection for this in semi
> joins, and I'd rather not go blindly setting the flag to true in the
> planner as it simply may not be true for the semi join. At the moment
> that might not matter as we're only using the unique_join flag as an
> optimization in the join nodes, but I'd prefer not to do this as its
> likely we'll want to do more with this flag later, and I'd rather we
> keep the meaning well defined. You might argue that this is also true
> for semi joins, but if down the road somewhere we want to perform some
> checks on the inner relation before the join takes place, and in that
> case the Tuples of the relation might not have the same properties we
> claim they do.
>
> But you're right that reusing the flag in the join nodes is not ideal,
> and the comment is not that great either. I'd really rather go down the
> path of either renaming the variable, or explaining this better in the
> comment. It seems unnecessary to check both for each tuple being joined.
> I'd imagine that might add a little overhead to joins which are not unique.

I'd be very surprised it that had any measurable impact.

> How about changing the comment to:
>
> /*
> * In the event that the inner side has been detected to be
> * unique, as an optimization we can skip searching for any
> * subsequent matching inner tuples, as none will exist.
> * For semijoins unique_inner will always be true, although
> * in this case we don't match another inner tuple as this
> * is the required semi join behavior.
> */
>
> Alternatively or additionally we can rename the variable in the executor
> state, although I've not thought of a name yet that I don't find overly
> verbose: unique_inner_or_semi_join, match_first_tuple_only.

I'd go with match_first_tuple_only.

>
>
> 2) analyzejoins.c
>
> I see that this code in join_is_removable()
>
> innerrel = find_base_rel(root, innerrelid);
>
> if (innerrel->reloptkind != RELOPT_BASEREL)
> return false;
>
> was rewritten like this:
>
> innerrel = find_base_rel(root, innerrelid);
>
> Assert(innerrel->reloptkind == RELOPT_BASEREL);
>
> That suggests that previously the function expected cases where
> reloptkind may not be RELOPT_BASEREL, but now it'll error out int
> such cases. I haven't noticed any changes in the surrounding code
> that'd guarantee this won't happen, but I also haven't been able to
> come up with an example triggering the assert (haven't been trying
> too hard). How do we know the assert() makes sense?
>
>
> I'd have changed this as this should be covered by the if
> (!sjinfo->is_unique_join || a few lines up. mark_unique_joins() only
> sets is_unique_join to true if specialjoin_is_unique_join() returns true
> and that function tests reloptkind to ensure its set to RELOPT_BASEREL
> and return false if the relation is not. Perhaps what is missing is a
> comment to explain the function should only be used on RELOPT_BASEREL
> type relations.

Yeah, I think it'd be good to document this contract somewhere. Either
in the comment before the function, or perhaps right above the Assert().

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2015-12-17 01:46:01 Re: Multi-tenancy with RLS
Previous Message Robert Haas 2015-12-17 00:01:40 Re: Remove array_nulls?