From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: Improve hash join's handling of tuples with null join keys |
Date: | 2025-08-14 02:36:06 |
Message-ID: | 544D7C83-CECE-44E7-B5D7-530E9318D231@gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> On Aug 13, 2025, at 17:16, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>
> I downloaded the patch and tested all join types: inner, left, right, full, semi and anti. Basically my tests all passed. However, I didn't test any case of parallel query.
>
> I have two nit comments:
>
> 1. In hashjoin.h, line 76-78, the added comment says "(In the unlikely but supported case of a non-strict join operator, we treat null keys as normal data.)". But I don't see where non-strict join is handled. So, how this patch impact non-strict joins?
>
I take back this comment, and I get a new comment related.
@@ -1015,11 +1144,19 @@ ExecHashJoinOuterGetTuple(PlanState *outerNode,
if (!isnull)
{
+ /* normal case with a non-null join key */
/* remember outer relation is not empty for possible rescan */
hjstate->hj_OuterNotEmpty = true;
return slot;
}
+ else if (hjstate->hj_KeepNullTuples)
+ {
+ /* null join key, but we must save tuple to be emitted later */
+ if (hjstate->hj_NullOuterTupleStore == NULL)
+ hjstate->hj_NullOuterTupleStore = ExecHashBuildNullTupleStore(hashtable);
+ tuplestore_puttupleslot(hjstate->hj_NullOuterTupleStore, slot);
+ }
When an outer tuple contains null join key, without this patch, “isnull” flag is false, so the tuple will still be returned, and for an outer join, the tuple slot will be returned to parent node immediately.
With this patch, “isnull” now becomes true because of the change of strict op. Then the outer null join key tuple must be stored in a tuplestore. When an outer table contains a lot of null join key tuples, then the tuplestore could bump to very large, in that case, it would be hard to say this patch really benefits.
I am think that, can we only do the first half of this patch? Only putting inner table’s null join key tuple into a tuplestore. So that inner hash table’s performance gets improved, and outer table’s logic keeps the same, then overall this patch makes a pure improvement without the potential memory burden from outerNullTupleStore.
—————————
I also got an idea for improving the hash logic.
/*
* If the outer relation is completely empty, and it's not
* right/right-anti/full join, we can quit without building
* the hash table. However, for an inner join it is only a
* win to check this when the outer relation's startup cost is
* less than the projected cost of building the hash table.
* Otherwise it's best to build the hash table first and see
* if the inner relation is empty. (When it's a left join, we
* should always make this check, since we aren't going to be
* able to skip the join on the strength of an empty inner
* relation anyway.)
*/
if (HJ_FILL_INNER(node))
{
/* no chance to not build the hash table */
node->hj_FirstOuterTupleSlot = NULL;
}
Based on this patch, if we are doing a left join, and outer table is empty, then all tuples from the inner table should be returned. In that case, we can skip building a hash table, instead, we can put all inner table tuples into hashtable.innerNullTupleStore. Building a tuplestore should be cheaper than building a hash table, so this way makes a little bit more performance improvement.
Regards,
Chao Li (Evan)
--------------------
HighGo Software Co., Ltd.
https://www.highgo.com/
From | Date | Subject | |
---|---|---|---|
Next Message | jian he | 2025-08-14 02:47:08 | Re: ALTER DOMAIN ADD NOT NULL NOT VALID |
Previous Message | Ziga | 2025-08-14 02:29:28 | Re: Retail DDL |