| From: | David Geier <geidav(dot)pg(at)gmail(dot)com> |
|---|---|
| To: | Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: Use merge-based matching for MCVs in eqjoinsel |
| Date: | 2025-11-17 15:28:23 |
| Message-ID: | 3a4211f6-c6d6-4ff3-a5c5-231179e4e519@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Ilia!
On 16.09.2025 17:52, Ilia Evdokimov wrote:
> Hi hackers,
>
> On 10.09.2025 16:56, Ilia Evdokimov wrote:
>> Unfortunately, the JOB benchmark does not contain semi join nodes.
>> However, TPC-DS does. I'll look for the queries with slowest planner
>> times there and check them.
>>
>> I'll need some time to check both join and semi join cases with small
>> and large default_statistics_target. I'll share the results later.
>
> JOIN
> ==============================
>
> I’ve benchmarked the new implementation of eqjoinsel() with different
> values of default_statistics_target. On small targets (1, 5, 10, 25, 50,
> 75, 100) the results are all within statistical noise, and I did not
> observe any regressions. In my view, it’s reasonable to keep the current
> condition that the hash table is not used for default_statistics_target
> = 1. Raising that threshold does not seem useful.
>
> Here are the results for JOB queries (where the effect of semi join is
> not visible due to different data distributions):
>
> default_statistics_target | Planner Speedup (×) | Planner Before (ms) |
> Planner After (ms)
> ------------------------------------------------------------------------------------------
> 1 | 1.00 | 1846.643 |
> 1847.409
> 5 | 1.00 | 1836.391 |
> 1828.318
> 10 | 0.95 | 1841.750 |
> 1929.722
> 25 | 0.99 | 1873.172 |
> 1890.741
> 50 | 0.98 | 1869.897 |
> 1898.470
> 75 | 1.02 | 1969.368 |
> 1929.521
> 100 | 0.97 | 1857.890 |
> 1921.207
> 1000 | 1.14 | 2279.700 |
> 1997.102
> 2500 | 1.78 | 4682.658 |
> 2636.202
> 5000 | 6.45 | 15943.696 |
> 2471.242
> 7500 | 12.45 | 34350.855 |
> 2758.565
> 10000 | 20.52 | 62519.342 |
> 3046.819
>
Good that we've confirmed that.
> SEMI JOIN
> ==============================
>
> Unfortunately, in TPC-DS it is not possible to clearly see improvements
> for semi joins. To address this, I designed a synthetic example where
> the data distribution forces the loop to run fully, without exiting
> early, which makes the effect on semi joins more visible. In this setup,
> I also ensured that the length of the MCV array is equal to the chosen
> default_statistics_target.
>
> CREATE TABLE t1 AS
> SELECT CASE
> WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 1
> ELSE (g % 1000000) + 10000
> END AS id
> FROM generate_series(1, 3000000) g;
>
> CREATE TABLE t2 AS
> SELECT CASE
> WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 10001
> ELSE (g % 1000000) + 20000
> END AS id
> FROM generate_series(1, 3000000) g;
>
> ANALYZE t1, t2;
>
> The results of the query are:
>
> SELECT * FROM t1
> WHERE id IN (SELECT id FROM t2);
>
> default_statistics_target | Planner Speedup (×) | Planner Before (ms) |
> Planner After (ms)
> ------------------------------------------------------------------------------------------
> 1 | 1.12 | 1.191 |
> 1.062
> 5 | 1.02 | 0.493 |
> 0.481
> 10 | 0.92 | 0.431 |
> 0.471
> 25 | 1.27 | 0.393 |
> 0.309
> 50 | 1.04 | 0.432 |
> 0.416
> 75 | 0.96 | 0.398 |
> 0.415
> 100 | 0.95 | 0.450 |
> 0.473
> 1000 | 9.42 | 6.742 |
> 0.716
> 2500 | 19.15 | 21.621 |
> 1.129
> 5000 | 46.74 | 85.667 |
> 1.833
> 7500 | 73.26 | 194.806 |
> 2.659
> 10000 | 107.95 | 349.981 |
> 3.242
>
That's some decent speedups, considering that it's planning time.
Thanks for testing the code!
--
David Geier
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David Geier | 2025-11-17 15:38:43 | Re: tuple radix sort |
| Previous Message | David Geier | 2025-11-17 15:25:35 | Re: Use merge-based matching for MCVs in eqjoinsel |