Re: [HACKERS] PoC: full merge join on comparison clause

From: Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: [HACKERS] PoC: full merge join on comparison clause
Date: 2018-07-10 16:36:56
Message-ID: 1a4bac9a-640d-31b8-fb6f-cc4bf36abbf9@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

I tried to fix the things you mentioned and improve the comments. Among
other changes, there is now a description of how merge join works with
inequalities at the top of nodeMergejoin.c. It also explains why we only
support one inequality clause.

Some particular points:

On 07/06/2018 04:01 PM, Ashutosh Bapat wrote:
> - StrategyNumber opstrategy = mergestrategies[iClause];
> + StrategyNumber sort_strategy = mergestrategies[iClause];
> - int op_strategy;
> + int join_strategy;
> I don't see a reason why should we change the name of variable here. These are
> operator strategies and there's no need to change their names. The name change
> is introducing unnecessary diffs.

These variables have different meaning but their names differ only with
an underscore. When I had to change this function, I made mistakes
because of this. I'd keep the descriptive names to avoid further
confusion. Should this be a separate patch?

> I think we should write a wrapper around MJCompare which interprets the result rather
> than changing MJCompare itself. OR at least change the name of MJCompare.

Renamed the function to MJTestTuples to reflect that it decides whether
we join tuples or advance either side.

> - * for get_mergejoin_opfamilies().
> + * for get_equality_opfamilies().
>
> I think we should leave mergejoin word in there or at least indicate that these
> are btree opfamilies so that we don't confuse it with hash equality operator
> families.

Renamed these to get_btree_equality_opfamilies() and
get_btree_comparison_opfamilies().

> + if (parent->mj_Ineq_Present)
> + elog(ERROR, "inequality mergejoin clause must be the last one");
> +
>
> IIUC, this never happens. If it really happens, we have created a path which
> can not be used practically. That should never happen. It will help to add a
> comment here clarifying that situation.

This is just a cross-check for the planner. Added a comment. We should
probably use a separate error code for internal errors as opposed to
user errors, but I'm not sure if we have one, I see just elog(ERROR)
being used everywhere.

>
> + bool have_inequality = have_inequality_mergeclause(mergeclauses);
>
> There will be many paths created with different ordering of pathkeys. So,
> instead of calling have_inequality_mergeclause() for each of those paths, it's
> better to save this status in the path itself when creating the path.

I removed this function altogether, because we can just check the last
merge clause. When we cost the path, we already have a proper
mergejoinable list of clauses, so if there is an inequality clause, it's
the last one.

> /* Make a pathkey list with this guy first */
> if (l != list_head(all_pathkeys))
> + {
> + if (have_inequality && l == list_tail(all_pathkeys))
> + /* Inequality merge clause must be the last, we can't
> move it */
> + break;
> +
>
> I am kind of baffled by this change. IIUC the way we create different orderings
> of pathkeys here, we are just rotating the pathkeys in circular order. This
> means there is exactly one ordering of pathkeys where the pathkey corresponding
> to the inequality clause is the last one.

This code does not rotate the pathkeys circularly, but puts each of them
in the first position, and keeps the rest in the original order.
Say, if we have three equality pathkeys, and one inequality pathkey at
the end (let's denote them as E1, E2, E3, IE), the permutations it tries
will be like this:
E1 E2 E3 IE
E2 E1 E3 IE
E3 E1 E2 IE
Does this sound right?

> /* Might have no mergeclauses */
> if (nClauses == 0)
> return NIL;
>
> + {
> + List *ineq_clauses = find_inequality_clauses(mergeclauses);
> +
> + if (list_length(ineq_clauses) > 1)
> + return NIL;
>
> Without this patch, when there is an inequality clause with FULL JOIN, we will
> not create a merge join path because select_mergejoin_clauses() will set
> mergejoin_allowed to false. This means that we won't call
> sort_inner_and_outer(). I think this patch also has to do the same i.e. when
> there are more than one inequality clauses, select_mergejoin_clauses() should
> set mergejoin_allowed to false in case of a FULL JOIN since merge join
> machinary won't be able to handle that case.
>
> If we do that, we could arrange extra.mergeclause_list such that the inequality
> clause is always at the end thus finding inequality clause would be easy.

I changed select_mergejoin_clauses() to filter multiple inequality
clauses and disable join if needed. Now we can use extra inequalities as
join filter, if it's not full join. I didn't reorder
extra.mergeclause_list there, because this order is ignored later.
select_outer_pathkeys_for_merge() chooses the order of pathkeys using
some heuristics, and then find_mergeclauses_for_outer_pathkeys()
reorders the clauses accordingly.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
0002-Inequality-merge-join-v9.patch text/x-patch 48.8 KB
0001-Preparatory-refactoring-v9.patch text/x-patch 25.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2018-07-10 16:37:49 Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX
Previous Message Dilip Kumar 2018-07-10 15:44:06 Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX