Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq
Date: 2026-02-25 12:19:06
Message-ID: 943fbefa-6514-42ba-a4c6-92a67ab542c3@app.fastmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 24, 2026, at 19:21, Joel Jacobson wrote:
> This bug seems to sometimes cause the wrong table, the larger table, to
> be hashed in a Hash Join, and the smaller table to be used for probing.
...
> The fix causes quite a lot of plans in
> src/test/regress/expected/partition_join.out to change, which makes me a
> bit worried I might have misunderstood something here. I haven't
> verified if all the new plans are improvements, I just copied the result
> file to the expected dir.

I've now investigated all the plan changes in

src/test/regress/expected/partition_join.out

due to this fix, and now feel confident this is a bug,
and that the bug fix is correct.

To benchmark, I beefed up the populated data in partition_join.sql
by x10000, e.g:

-CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
-CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
-CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
-INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 599) i WHERE i % 2 = 0;
+CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (2500000);
+CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (5000000) TO (6000000);
+CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (2500000) TO (5000000);
+INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 5999999) i WHERE i % 2 = 0;

I then measured all queries that produced a different plan,
using EXPLAIN ANALYZE, here are the results:

joel=# SELECT COUNT(*), COUNT(*) FILTER (WHERE execution_time_head > execution_time_patch) AS faster, COUNT(*) FILTER (WHERE execution_time_head < execution_time_patch) AS slower FROM partition_join_bench;
count | faster | slower
-------+--------+--------
27 | 16 | 11
(1 row)

joel=# SELECT SUM(execution_time_head) AS total_execution_time_master, SUM(execution_time_patch) AS total_execution_time_patch, 1-SUM(execution_time_patch)/SUM(execution_time_head) AS improvement FROM partition_join_bench;
total_execution_time_master | total_execution_time_patch | improvement
-----------------------------+----------------------------+------------------------
3577.826 | 2892.280 | 0.19160965345995026030
(1 row)

joel=# SELECT execution_time_head, execution_time_patch, execution_time_head-execution_time_patch AS diff FROM partition_join_bench ORDER BY 3;
execution_time_head | execution_time_patch | diff
---------------------+----------------------+---------
128.481 | 170.469 | -41.988
59.927 | 84.131 | -24.204
63.928 | 87.188 | -23.260
57.315 | 78.443 | -21.128
65.779 | 84.669 | -18.890
65.456 | 81.128 | -15.672
57.349 | 72.832 | -15.483
63.383 | 77.267 | -13.884
60.248 | 73.359 | -13.111
61.173 | 67.388 | -6.215
67.052 | 69.475 | -2.423
79.368 | 78.874 | 0.494
61.533 | 56.617 | 4.916
108.781 | 92.301 | 16.480
124.661 | 98.540 | 26.121
146.671 | 117.109 | 29.562
112.973 | 79.949 | 33.024
119.745 | 82.465 | 37.280
145.449 | 99.523 | 45.926
239.796 | 166.813 | 72.983
228.056 | 154.956 | 73.100
225.025 | 145.068 | 79.957
261.493 | 173.595 | 87.898
245.301 | 157.054 | 88.247
239.626 | 149.158 | 90.468
243.589 | 147.587 | 96.002
245.668 | 146.322 | 99.346
(27 rows)

In total the improvement is about 20%.

The faster queries are due to swapping the build/probe side,
so that the planner hash the smaller filtered side instead
of the larger unfiltered side.

The slower queries are due to fixing the hash join cost estimate,
which makes the makes hash join look cheaper than nested loop.
But at this data scale, nested loop is still a win for such cases.

I benchmarked just in case I'd missed something.
These results makes me confident we have a bug,
and that this fix is correct.

/Joel

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2026-02-25 12:26:21 Passing mac-addresses by value?
Previous Message Jakub Wartak 2026-02-25 11:57:19 Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?