Re: Improve hash join's handling of tuples with null join keys

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improve hash join's handling of tuples with null join keys
Date: 2025-06-02 16:47:29
Message-ID: 496221.1748882849@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> On Tue, May 6, 2025 at 12:12 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> On 5/6/25 01:11, Tom Lane wrote:
>>> The attached patch is a response to the discussion at [1], where
>>> it emerged that lots of rows with null join keys can send a hash
>>> join into too-many-batches hell, if they are on the outer side
>>> of the join so that they must be null-extended not just discarded.

> Good idea. I haven't reviewed it properly, but one observation is
> that trapping the null-keys tuples in per-worker tuple stores creates
> unfairness. That could be fixed by using a SharedTuplestore instead,
> but unfortunately SharedTuplestore always spills to disk at the
> moment, so maybe I should think about how to give it some memory for
> small sets like regular Tuplestore. Will look more closely after
> Montreal.

Hmm ... I'm unpersuaded that "fairness" is an argument for adding
overhead to the processing of these tuples. It's very hard to see
how shoving them into a shared tuplestore can beat not shoving them
into a shared tuplestore. But if you want to poke at that idea,
feel free.

In the meantime, I noticed that my patch was intermittently failing
in CI, and was able to reproduce that locally. It turns out I'd
missed the point that we might accumulate some null-keyed tuples
into local tuplestores during a parallel HJ_BUILD_HASHTABLE step.
Ordinarily that doesn't matter because we'll dump them anyway at
conclusion of the first batch. But with the right timing, we might
collect some tuples and yet, by the time we're ready to process a
batch, there are none left to do. Then the state machine fell out
without ever dumping those tuples. (For some reason this is way
easier to reproduce under FreeBSD than Linux --- scheduler quirk
I guess.)

v2 attached fixes that, and improves some comments.

regards, tom lane

Attachment Content-Type Size
v2-0001-Improve-hash-join-s-handling-of-tuples-with-null-.patch text/x-diff 36.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2025-06-02 16:50:42 Re: Vacuum statistics
Previous Message Jeff Davis 2025-06-02 16:45:55 Re: pg_upgrade: warn about roles with md5 passwords