Re: Use merge-based matching for MCVs in eqjoinsel

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Geier <geidav(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Use merge-based matching for MCVs in eqjoinsel
Date: 2025-11-10 14:20:42
Message-ID: f8e6b122-b598-4adb-b91e-89fb3cdc1db5@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the detailed feedback!

On 04.11.2025 00:55, Tom Lane wrote:

> Hmm. Those results sure look like there is a performance regression
> up to at least 100 MCVs ... not a large one, but consistently a few
> percent. That's a bit sad for a patch purporting to improve
> performance. It looks to me like perhaps we should stick to the old
> algorithm up to 100 or possibly even more MCVs. Certainly the
> threshold needs to be higher than 1, as you have it now.

I re-ran the benchmark on JOB with a threshold of 100.Here are the
updated results:

default_statistics_target | Planner Speedup (×) | Planner Before (ms) |
Planner After (ms)
--------------------------------------------------------------------------------
1                         | 1.00                | 2320.412            |
2318.377
5                         | 0.99                | 2335.894            |
2360.890
10                        | 1.00                | 2350.612            |
2347.154
25                        | 1.01                | 2365.977            |
2342.312
50                        | 0.99                | 2381.554            |
2405.262
75                        | 1.00                | 2396.481            |
2399.828
100                       | 1.00                | 2410.367            |
2412.456
1000                      | 1.11                | 2850.853            |
2564.303
2500                      | 2.04                | 5571.688            |
2731.545
5000                      | 6.05                | 18850.084           |
3114.692
7500                      | 11.96               | 39160.898           |
3273.688
10000                     | 19.04               | 71334.113           |
3745.955

> I eyeballed the patch itself very briefly, and have a couple
> quick comments:
>
> * Is hash_msv a typo for hash_mcv? If not, maybe spell out what
> it's supposed to mean.

Yes, that was a typo — fixed.

> * The patch would be easier to read if it didn't reindent a couple
> large chunks of existing code. Can we change the factorization
> to avoid that? If not, I'd recommend submitting without that
> reindentation, and reminding the committer to reindent at the last
> moment.

Fixed as well. I’ve removed all reindentation changes. I will keep that
in mind for future submissions.

> * The calculation loops in eqjoinsel_inner and eqjoinsel_semi
> are not identical, which makes it look quite weird to be
> writing just one function that conditionally replaces both.
> I wonder if we should refactor to have just one copy (and
> just eat the extra cycles of calculating matchprodfreq).

Agreed. I’ve dropped the attempt to merge them into a single function.

>
> * In fact ... I wonder if we should try harder to not do essentially
> identical calculations twice, remembering that eqjoinsel_semi is
> always used alongside eqjoinsel_inner. (Of course, we could only do
> that if clamped_nvalues2 is the same as sslot2->nvalues, but that's
> frequently true.) I think the reason it's duplicative right now
> is that we regarded this semijoin calculation as an experiment and
> so didn't want to invest a lot of effort in it ... but this patch
> is exactly a lot of effort, so maybe it's time to deal with that
> unfinished business.
>
> regards, tom lane

Good point. I addressed this in a separate patch: eqjoinsel_inner() now
saves matchfreq1, matchfreq2, nmatches so that eqjoinsel_semi() can
reuse them when (clamped_nvalues2 == sslot2->nvalues). If the MCV list
on the RHS is clamped, we still recompute locally. If you have a cleaner
idea for how to share these values between the two functions without
passing them explicitly, I’d be happy to consider it.

I’m attaching two patches:
1. v4-0001-Avoid-duplicate-MCV-matching-in-eqjoinsel_semi-an.patch -
removes redundant MCV matching for semi/anti joins;
2. v4-0002-Optimize-MCV-matching-in-eqjoinsel_inner-and-eqjo.patch -
adds hash-based MCV matching with a configurable threshold and includes
fixes based on your comments.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

Attachment Content-Type Size
v1-0001-Avoid-duplicate-MCV-matching-in-eqjoinsel_semi-an.patch text/x-patch 5.2 KB
v4-0002-Optimize-MCV-matching-in-eqjoinsel_inner-and-eqjo.patch text/x-patch 11.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2025-11-10 14:21:13 Re: Minor adjustment: Update the range of the commit_siblings parameter.
Previous Message Heikki Linnakangas 2025-11-10 14:17:49 Re: Move SLRU_PAGES_PER_SEGMENT to pg_config_manual.h