| From: | Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com> |
|---|---|
| To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Cc: | David Geier <geidav(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: Use merge-based matching for MCVs in eqjoinsel |
| Date: | 2025-11-17 15:17:25 |
| Message-ID: | f0a2c63a-5776-462a-b9b7-434158297a9d@tantorlabs.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 14.11.2025 01:21, Tom Lane wrote:
> This didn't look very much like the refactorization that I had in
> mind: I thought we should have one copy of the matching code, not two.
> Also, after looking closer at your patch I realized you were just
> punting for cross-type comparison operators, which I felt was kind
> of sad. It's a little bit tricky to get simplehash.h to go along
> with cross-type hashing, because it wants to use just one hash and
> one equality function. But since those are interface routines we
> are going to supply anyway, we can make them deal with the insert
> and lookup cases differently.
I had considered the cross-type comparison operators and I didn’t see a
clean way to support them, so I intentionally excluded cross-type cases
from hash probing. Your suggestion to switch the hash function in probe
mode is clearly a more correct approach than simply rejecting those
cases. Thanks for the explanation.
>
> So after a bit of hacking I ended up with the attached. I split up
> the refactorization into several steps to make it easier to review.
> (But I'd anticipate squashing these into one commit in the end,
> so I didn't spend a lot of time on the commit messages.)
I reviewed patches 0002-0004 with the refactoring, and I think the
overall approach is excellent. However, I noticed one issue: in
eqjoinsel_semi() the variable 'nmatches' is not initialized and can lead
to undefined behavior when clamped_nvalues2 == sslot2->nvalues. Before
the refactoring it was initialized by zero.
> Also, 0001 in this series is not meant to be committed; what it
> does is to add some debug logging to ease comparing runtimes of
> different versions of eqjoinsel. I was able to use that to
> convince myself that the refactoring steps didn't cost anything
> meaningful in performance. Perhaps we could use it to investigate
> the right hashing threshold more carefully, too.
With 0001 patch I tested the selectivity calculation time for SEMI JOIN
after applying patches 0002-0004, and the time was cut in half. Thank
you for the work on that.
>
> There are still a couple of XXX comments in the attached, denoting
> loose ends to look at. In particular, I wondered whether the
> hash threshold check
>
> if (Min(sslot1.nvalues, sslot2.nvalues) >= EQJOINSEL_MCV_HASH_THRESHOLD)
>
> should use Max() instead --- that is, it might be safer to hash
> if either MCV list is long. Or, holding one's head at a different
> angle, perhaps the sum of the list lengths should be what's checked?
>
> regards, tom lane
>
Hmm… using the sum actually seems like a good idea for me. It may
provide a smoother switch-over point between the two MCV-scanning
algorithms when both list lengths are below the threshold. But this
definitely needs to be validated by measuring different MCV lengths
below the threshold using the 0001 patch.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David Geier | 2025-11-17 15:25:35 | Re: Use merge-based matching for MCVs in eqjoinsel |
| Previous Message | Robert Haas | 2025-11-17 15:09:36 | Re: pg_plan_advice |