Re: Use merge-based matching for MCVs in eqjoinsel

From: David Geier <geidav(dot)pg(at)gmail(dot)com>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Use merge-based matching for MCVs in eqjoinsel
Date: 2025-09-08 10:56:07
Message-ID: 412007b6-a324-44f5-91ed-c869004854ec@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Ilia!

> I have read all the previous messages - and yes, you are right. I don’t
> know why I didn’t consider using a hash table approach initially. Your
> idea makes a lot of sense.

Your solution would be beneficial on top, for cases where the data type
is not hashable. But I think that's overkill for a v1. I would start
with the hash-based version.

> To evaluate it, I ran benchmarks on JOB with three variants:
>
> $ ./benchmark.sh master
> $ ./benchmark.sh merge
> $ ./benchmark.sh hash
>
> I compared total planning time across all 113 queries.

Was this running with optimizations? How did you extract the planning time?

>
> $ python3 planning_time.py master hash
> default_statistics_target | Planner Speedup (×) | Planner Before (ms) |
> Planner After (ms)
> --------------------------------------------------------------------------------
> 100                       | 1.00                | 1892.627            |
> 1886.969
> 1000                      | 1.09                | 2286.922            |
> 2100.099
> 2500                      | 1.94                | 4647.167            |
> 2400.711
> 5000                      | 6.15                | 17964.779           |
> 2919.914
> 7500                      | 10.58               | 38622.443           |
> 3650.375
> 10000                     | 16.33               | 69538.085           |
> 4257.864
>
> $ python3 planning_time.py master merge
> default_statistics_target | Planner Speedup (×) | Planner Before (ms) |
> Planner After (ms)
> --------------------------------------------------------------------------------
> 100                       | 1.00                | 1892.627            |
> 1898.622
> 1000                      | 1.12                | 2286.922            |
> 2033.553
> 2500                      | 1.92                | 4647.167            |
> 2423.552
> 5000                      | 5.94                | 17964.779           |
> 3025.739
> 7500                      | 10.48               | 38622.443           |
> 3684.262
> 10000                     | 16.72               | 69538.085           |
> 4159.418

I would have expected the delta between the "merge" and "hash" variant
to be bigger, especially for default_statistics_target=10000. My small
test also showed that. Any idea why this is not showing in your results?

> Based on these results, I’d prefer the hash lookup implementation, so I
> think it makes sense to improve your patch further and bring it into
> good shape. Shall I take care of that, or would you prefer to do it
> yourself?

I think process-wise it's best if you review my code and I do the changes.

Could you as part of your review test tables with just a few MCVs to
make sure we're not regressing "small" cases? For now I'm only bailing
if one of the two MCV lists has just a single value. I'm expecting the
gains from fine tuning this value to be not measurable but let's double
check.

--
David Geier

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2025-09-08 10:57:48 Re: Conflict detection for update_deleted in logical replication
Previous Message Ilia Evdokimov 2025-09-08 10:45:06 Re: Use merge-based matching for MCVs in eqjoinsel