From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Improve hash join's handling of tuples with null join keys |
Date: | 2025-06-02 16:47:29 |
Message-ID: | 496221.1748882849@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> On Tue, May 6, 2025 at 12:12 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> On 5/6/25 01:11, Tom Lane wrote:
>>> The attached patch is a response to the discussion at [1], where
>>> it emerged that lots of rows with null join keys can send a hash
>>> join into too-many-batches hell, if they are on the outer side
>>> of the join so that they must be null-extended not just discarded.
> Good idea. I haven't reviewed it properly, but one observation is
> that trapping the null-keys tuples in per-worker tuple stores creates
> unfairness. That could be fixed by using a SharedTuplestore instead,
> but unfortunately SharedTuplestore always spills to disk at the
> moment, so maybe I should think about how to give it some memory for
> small sets like regular Tuplestore. Will look more closely after
> Montreal.
Hmm ... I'm unpersuaded that "fairness" is an argument for adding
overhead to the processing of these tuples. It's very hard to see
how shoving them into a shared tuplestore can beat not shoving them
into a shared tuplestore. But if you want to poke at that idea,
feel free.
In the meantime, I noticed that my patch was intermittently failing
in CI, and was able to reproduce that locally. It turns out I'd
missed the point that we might accumulate some null-keyed tuples
into local tuplestores during a parallel HJ_BUILD_HASHTABLE step.
Ordinarily that doesn't matter because we'll dump them anyway at
conclusion of the first batch. But with the right timing, we might
collect some tuples and yet, by the time we're ready to process a
batch, there are none left to do. Then the state machine fell out
without ever dumping those tuples. (For some reason this is way
easier to reproduce under FreeBSD than Linux --- scheduler quirk
I guess.)
v2 attached fixes that, and improves some comments.
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
v2-0001-Improve-hash-join-s-handling-of-tuples-with-null-.patch | text/x-diff | 36.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alena Rybakina | 2025-06-02 16:50:42 | Re: Vacuum statistics |
Previous Message | Jeff Davis | 2025-06-02 16:45:55 | Re: pg_upgrade: warn about roles with md5 passwords |