Re: Use merge-based matching for MCVs in eqjoinsel

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Geier <geidav(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Use merge-based matching for MCVs in eqjoinsel
Date: 2025-11-17 15:17:25
Message-ID: f0a2c63a-5776-462a-b9b7-434158297a9d@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 14.11.2025 01:21, Tom Lane wrote:
> This didn't look very much like the refactorization that I had in
> mind: I thought we should have one copy of the matching code, not two.
> Also, after looking closer at your patch I realized you were just
> punting for cross-type comparison operators, which I felt was kind
> of sad. It's a little bit tricky to get simplehash.h to go along
> with cross-type hashing, because it wants to use just one hash and
> one equality function. But since those are interface routines we
> are going to supply anyway, we can make them deal with the insert
> and lookup cases differently.

I had considered the cross-type comparison operators and I didn’t see a
clean way to support them, so I intentionally excluded cross-type cases
from hash probing. Your suggestion to switch the hash function in probe
mode is clearly a more correct approach than simply rejecting those
cases. Thanks for the explanation.

>
> So after a bit of hacking I ended up with the attached. I split up
> the refactorization into several steps to make it easier to review.
> (But I'd anticipate squashing these into one commit in the end,
> so I didn't spend a lot of time on the commit messages.)

I reviewed patches 0002-0004 with the refactoring, and I think the
overall approach is excellent. However, I noticed one issue: in
eqjoinsel_semi() the variable 'nmatches' is not initialized and can lead
to undefined behavior when clamped_nvalues2 == sslot2->nvalues. Before
the refactoring it was initialized by zero.

> Also, 0001 in this series is not meant to be committed; what it
> does is to add some debug logging to ease comparing runtimes of
> different versions of eqjoinsel. I was able to use that to
> convince myself that the refactoring steps didn't cost anything
> meaningful in performance. Perhaps we could use it to investigate
> the right hashing threshold more carefully, too.

With 0001 patch I tested the selectivity calculation time for SEMI JOIN
after applying patches 0002-0004, and the time was cut in half. Thank
you for the work on that.

>
> There are still a couple of XXX comments in the attached, denoting
> loose ends to look at. In particular, I wondered whether the
> hash threshold check
>
> if (Min(sslot1.nvalues, sslot2.nvalues) >= EQJOINSEL_MCV_HASH_THRESHOLD)
>
> should use Max() instead --- that is, it might be safer to hash
> if either MCV list is long. Or, holding one's head at a different
> angle, perhaps the sum of the list lengths should be what's checked?
>
> regards, tom lane
>

Hmm… using the sum actually seems like a good idea for me. It may
provide a smoother switch-over point between the two MCV-scanning
algorithms when both list lengths are below the threshold. But this
definitely needs to be validated by measuring different MCV lengths
below the threshold using the 0001 patch.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Geier 2025-11-17 15:25:35 Re: Use merge-based matching for MCVs in eqjoinsel
Previous Message Robert Haas 2025-11-17 15:09:36 Re: pg_plan_advice