Re: Performance improvement for joins where outer side is unique

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Performance improvement for joins where outer side is unique
Date: 2015-12-16 16:02:29
Message-ID: 56718B15.6060507@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 12/16/2015 01:27 AM, David Rowley wrote:
>
> I've attached a rebased patch against current master as there were
> some conflicts from the recent changes to LATERAL join.

Thanks. I've looked at the rebased patch and have a few minor comments.

0) I know the patch does not tweak costing - any plans in this
direction? Would it be possible to simply use the costing used by
semijoin?

1) nodeHashjoin.c (and other join nodes)

I've noticed we have this in the ExecHashJoin() method:

/*
* When the inner side is unique or we're performing a
* semijoin, we'll consider returning the first match, but
* after that we're done with this outer tuple.
*/
if (node->js.unique_inner)
node->hj_JoinState = HJ_NEED_NEW_OUTER;

That seems a bit awkward because the comment speaks about unique joins
*OR* semijoins, but the check only references unique_inner. That of
course works because we do this in ExecInitHashJoin():

switch (node->join.jointype)
{
case JOIN_SEMI:
hjstate->js.unique_inner = true;
/* fall through */

Either some semijoins may not be unique - in that case the naming is
misleading at best, and we either need to find a better name or simply
check two conditions like this:

if (node->js.unique_inner || node->join.type == JOIN_SEMI)
node->hj_JoinState = HJ_NEED_NEW_OUTER;

Or all semijoins are unique joins, and in that case the comment may need
rephrasing. But more importantly, it begs the question why we're
detecting this in the executor and not in the planner? Because if we
detect it in executor, we can't use this bit in costing, for example.

FWIW the misleading naming was actually mentioned in the thread by
Horiguchi-san, but I don't see any response to that (might have missed
it though).

2) analyzejoins.c

I see that this code in join_is_removable()

innerrel = find_base_rel(root, innerrelid);

if (innerrel->reloptkind != RELOPT_BASEREL)
return false;

was rewritten like this:

innerrel = find_base_rel(root, innerrelid);

Assert(innerrel->reloptkind == RELOPT_BASEREL);

That suggests that previously the function expected cases where
reloptkind may not be RELOPT_BASEREL, but now it'll error out int such
cases. I haven't noticed any changes in the surrounding code that'd
guarantee this won't happen, but I also haven't been able to come up
with an example triggering the assert (haven't been trying too hard).
How do we know the assert() makes sense?

3) joinpath.c

- either "were" or "been" seems redundant

/* left joins were already been analyzed for uniqueness in
mark_unique_joins() */

4) analyzejoins.c

comment format broken

/*
* mark_unique_joins
Analyze joins in order to determine if their inner side is unique based
on the join condition.
*/

5) analyzejoins.c

missing line before the comment

}
/*
* rel_supports_distinctness
* Returns true if rel has some properties which can prove the relation
* to be unique over some set of columns.
*

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-12-16 16:21:53 Re: fix for readline terminal size problems when window is resized with open pager
Previous Message Stas Kelvich 2015-12-16 14:31:33 Re: Cube extension kNN support