From: | Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com> |
---|---|
To: | David Geier <geidav(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Use merge-based matching for MCVs in eqjoinsel |
Date: | 2025-09-16 15:52:47 |
Message-ID: | c3dbf2ab-d72d-4033-822a-60ad8023f499@tantorlabs.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi hackers,
On 10.09.2025 16:56, Ilia Evdokimov wrote:
> Unfortunately, the JOB benchmark does not contain semi join nodes.
> However, TPC-DS does. I'll look for the queries with slowest planner
> times there and check them.
>
> I'll need some time to check both join and semi join cases with small
> and large default_statistics_target. I'll share the results later.
JOIN
==============================
I’ve benchmarked the new implementation of eqjoinsel() with different
values of default_statistics_target. On small targets (1, 5, 10, 25, 50,
75, 100) the results are all within statistical noise, and I did not
observe any regressions. In my view, it’s reasonable to keep the current
condition that the hash table is not used for default_statistics_target
= 1. Raising that threshold does not seem useful.
Here are the results for JOB queries (where the effect of semi join is
not visible due to different data distributions):
default_statistics_target | Planner Speedup (×) | Planner Before (ms) |
Planner After (ms)
------------------------------------------------------------------------------------------
1 | 1.00 | 1846.643 |
1847.409
5 | 1.00 | 1836.391 |
1828.318
10 | 0.95 | 1841.750 |
1929.722
25 | 0.99 | 1873.172 |
1890.741
50 | 0.98 | 1869.897 |
1898.470
75 | 1.02 | 1969.368 |
1929.521
100 | 0.97 | 1857.890 |
1921.207
1000 | 1.14 | 2279.700 |
1997.102
2500 | 1.78 | 4682.658 |
2636.202
5000 | 6.45 | 15943.696 |
2471.242
7500 | 12.45 | 34350.855 |
2758.565
10000 | 20.52 | 62519.342 |
3046.819
SEMI JOIN
==============================
Unfortunately, in TPC-DS it is not possible to clearly see improvements
for semi joins. To address this, I designed a synthetic example where
the data distribution forces the loop to run fully, without exiting
early, which makes the effect on semi joins more visible. In this setup,
I also ensured that the length of the MCV array is equal to the chosen
default_statistics_target.
CREATE TABLE t1 AS
SELECT CASE
WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 1
ELSE (g % 1000000) + 10000
END AS id
FROM generate_series(1, 3000000) g;
CREATE TABLE t2 AS
SELECT CASE
WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 10001
ELSE (g % 1000000) + 20000
END AS id
FROM generate_series(1, 3000000) g;
ANALYZE t1, t2;
The results of the query are:
SELECT * FROM t1
WHERE id IN (SELECT id FROM t2);
default_statistics_target | Planner Speedup (×) | Planner Before (ms) |
Planner After (ms)
------------------------------------------------------------------------------------------
1 | 1.12 | 1.191 |
1.062
5 | 1.02 | 0.493 |
0.481
10 | 0.92 | 0.431 |
0.471
25 | 1.27 | 0.393 |
0.309
50 | 1.04 | 0.432 |
0.416
75 | 0.96 | 0.398 |
0.415
100 | 0.95 | 0.450 |
0.473
1000 | 9.42 | 6.742 |
0.716
2500 | 19.15 | 21.621 |
1.129
5000 | 46.74 | 85.667 |
1.833
7500 | 73.26 | 194.806 |
2.659
10000 | 107.95 | 349.981 |
3.242
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com
From | Date | Subject | |
---|---|---|---|
Next Message | Shirisha Shirisha | 2025-09-16 16:37:48 | Re: Proposal to allow DELETE/UPDATE on partitioned tables with unsupported foreign partitions |
Previous Message | Bruce Momjian | 2025-09-16 15:49:11 | Re: PG 18 release notes draft committed |