Re: Use merge-based matching for MCVs in eqjoinsel

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
Cc: David Geier <geidav(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Use merge-based matching for MCVs in eqjoinsel
Date: 2025-11-13 22:21:54
Message-ID: 3026409.1763072514@sss.pgh.pa.us
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com> writes:
> Good point. I addressed this in a separate patch: eqjoinsel_inner() now
> saves matchfreq1, matchfreq2, nmatches so that eqjoinsel_semi() can
> reuse them when (clamped_nvalues2 == sslot2->nvalues). If the MCV list
> on the RHS is clamped, we still recompute locally. If you have a cleaner
> idea for how to share these values between the two functions without
> passing them explicitly, I’d be happy to consider it.

This didn't look very much like the refactorization that I had in
mind: I thought we should have one copy of the matching code, not two.
Also, after looking closer at your patch I realized you were just
punting for cross-type comparison operators, which I felt was kind
of sad. It's a little bit tricky to get simplehash.h to go along
with cross-type hashing, because it wants to use just one hash and
one equality function. But since those are interface routines we
are going to supply anyway, we can make them deal with the insert
and lookup cases differently.

So after a bit of hacking I ended up with the attached. I split up
the refactorization into several steps to make it easier to review.
(But I'd anticipate squashing these into one commit in the end,
so I didn't spend a lot of time on the commit messages.)

Also, 0001 in this series is not meant to be committed; what it
does is to add some debug logging to ease comparing runtimes of
different versions of eqjoinsel. I was able to use that to
convince myself that the refactoring steps didn't cost anything
meaningful in performance. Perhaps we could use it to investigate
the right hashing threshold more carefully, too.

There are still a couple of XXX comments in the attached, denoting
loose ends to look at. In particular, I wondered whether the
hash threshold check

if (Min(sslot1.nvalues, sslot2.nvalues) >= EQJOINSEL_MCV_HASH_THRESHOLD)

should use Max() instead --- that is, it might be safer to hash
if either MCV list is long. Or, holding one's head at a different
angle, perhaps the sum of the list lengths should be what's checked?

regards, tom lane

Attachment Content-Type Size
v5-0001-Add-some-test-scaffolding-to-join_selectivity.patch text/x-diff 2.4 KB
v5-0002-Factor-out-duplicative-code-in-eqjoinsel_inner-eq.patch text/x-diff 8.9 KB
v5-0003-Rethink-eqjoinsel-s-handling-of-reversed-joins.patch text/x-diff 6.3 KB
v5-0004-Share-more-work-between-eqjoinsel_inner-and-eqjoi.patch text/x-diff 12.3 KB
v5-0005-Use-hashing-to-avoid-O-N-2-matching-work-in-eqjoi.patch text/x-diff 14.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2025-11-13 22:36:06 Re: Performance issues with parallelism and LIMIT
Previous Message Sami Imseih 2025-11-13 22:02:24 Re: pgsql: Drop unnamed portal immediately after execution to completion