| From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
|---|---|
| To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq |
| Date: | 2026-02-25 12:19:06 |
| Message-ID: | 943fbefa-6514-42ba-a4c6-92a67ab542c3@app.fastmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Feb 24, 2026, at 19:21, Joel Jacobson wrote:
> This bug seems to sometimes cause the wrong table, the larger table, to
> be hashed in a Hash Join, and the smaller table to be used for probing.
...
> The fix causes quite a lot of plans in
> src/test/regress/expected/partition_join.out to change, which makes me a
> bit worried I might have misunderstood something here. I haven't
> verified if all the new plans are improvements, I just copied the result
> file to the expected dir.
I've now investigated all the plan changes in
src/test/regress/expected/partition_join.out
due to this fix, and now feel confident this is a bug,
and that the bug fix is correct.
To benchmark, I beefed up the populated data in partition_join.sql
by x10000, e.g:
-CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
-CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
-CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
-INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 599) i WHERE i % 2 = 0;
+CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (2500000);
+CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (5000000) TO (6000000);
+CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (2500000) TO (5000000);
+INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 5999999) i WHERE i % 2 = 0;
I then measured all queries that produced a different plan,
using EXPLAIN ANALYZE, here are the results:
joel=# SELECT COUNT(*), COUNT(*) FILTER (WHERE execution_time_head > execution_time_patch) AS faster, COUNT(*) FILTER (WHERE execution_time_head < execution_time_patch) AS slower FROM partition_join_bench;
count | faster | slower
-------+--------+--------
27 | 16 | 11
(1 row)
joel=# SELECT SUM(execution_time_head) AS total_execution_time_master, SUM(execution_time_patch) AS total_execution_time_patch, 1-SUM(execution_time_patch)/SUM(execution_time_head) AS improvement FROM partition_join_bench;
total_execution_time_master | total_execution_time_patch | improvement
-----------------------------+----------------------------+------------------------
3577.826 | 2892.280 | 0.19160965345995026030
(1 row)
joel=# SELECT execution_time_head, execution_time_patch, execution_time_head-execution_time_patch AS diff FROM partition_join_bench ORDER BY 3;
execution_time_head | execution_time_patch | diff
---------------------+----------------------+---------
128.481 | 170.469 | -41.988
59.927 | 84.131 | -24.204
63.928 | 87.188 | -23.260
57.315 | 78.443 | -21.128
65.779 | 84.669 | -18.890
65.456 | 81.128 | -15.672
57.349 | 72.832 | -15.483
63.383 | 77.267 | -13.884
60.248 | 73.359 | -13.111
61.173 | 67.388 | -6.215
67.052 | 69.475 | -2.423
79.368 | 78.874 | 0.494
61.533 | 56.617 | 4.916
108.781 | 92.301 | 16.480
124.661 | 98.540 | 26.121
146.671 | 117.109 | 29.562
112.973 | 79.949 | 33.024
119.745 | 82.465 | 37.280
145.449 | 99.523 | 45.926
239.796 | 166.813 | 72.983
228.056 | 154.956 | 73.100
225.025 | 145.068 | 79.957
261.493 | 173.595 | 87.898
245.301 | 157.054 | 88.247
239.626 | 149.158 | 90.468
243.589 | 147.587 | 96.002
245.668 | 146.322 | 99.346
(27 rows)
In total the improvement is about 20%.
The faster queries are due to swapping the build/probe side,
so that the planner hash the smaller filtered side instead
of the larger unfiltered side.
The slower queries are due to fixing the hash join cost estimate,
which makes the makes hash join look cheaper than nested loop.
But at this data scale, nested loop is still a win for such cases.
I benchmarked just in case I'd missed something.
These results makes me confident we have a bug,
and that this fix is correct.
/Joel
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Aleksander Alekseev | 2026-02-25 12:26:21 | Passing mac-addresses by value? |
| Previous Message | Jakub Wartak | 2026-02-25 11:57:19 | Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |