| From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
|---|---|
| To: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Cc: | "Tender Wang" <tndrwang(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq |
| Date: | 2026-03-03 15:05:29 |
| Message-ID: | 5ddabbff-d0cb-4891-85fc-142f6d39d641@app.fastmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Sun, Mar 1, 2026, at 22:12, Tom Lane wrote:
> Aside: you could argue that failing to consider stanullfrac is wrong,
> and maybe it is. But the more I looked at this code the more
> convinced I got that it was only partially accounting for nulls
> anyway. That seems like perhaps something to look into later.
How about adjusting estfract for the null fraction before clamping?
```diff
@@ -4461,20 +4473,27 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
/*
* Initial estimate of bucketsize fraction is 1/nbuckets as long as the
* number of buckets is less than the expected number of distinct values;
* otherwise it is 1/ndistinct.
*/
if (ndistinct > nbuckets)
estfract = 1.0 / nbuckets;
else
estfract = 1.0 / ndistinct;
+ /*
+ * Adjust for null fraction. NULL keys are not inserted into the hash
+ * table, but inner_path_rows in final_cost_hashjoin includes them, so we
+ * must discount estfract to compensate.
+ */
+ estfract *= (1.0 - stanullfrac);
+
/*
* Clamp the bucketsize fraction to be not less than the MCV frequency,
* since whichever bucket the MCV values end up in will have at least that
* size. This has no effect if *mcv_freq is still zero.
*/
estfract = Max(estfract, *mcv_freq);
*bucketsize_frac = (Selectivity) estfract;
ReleaseVariableStats(vardata);
```
Here is an extreme example where 98% of all hj_nullfrac_inner.val are NULL:
CREATE TABLE hj_nullfrac_inner (id int, val int);
INSERT INTO hj_nullfrac_inner
SELECT g, CASE WHEN g <= 1000 THEN g ELSE NULL END
FROM generate_series(1, 50000) g;
CREATE INDEX ON hj_nullfrac_inner(val);
CREATE TABLE hj_nullfrac_outer (id int, val int);
INSERT INTO hj_nullfrac_outer
SELECT g, ((g-1) % 1000)+1
FROM generate_series(1, 500000) g;
ANALYZE hj_nullfrac_inner;
ANALYZE hj_nullfrac_outer;
EXPLAIN ANALYZE
SELECT * FROM hj_nullfrac_outer o JOIN hj_nullfrac_inner i ON o.val = i.val;
Both master (HEAD) and v2 results in the same plan:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.30..20022.95 rows=499834 width=16) (actual time=0.016..289.314 rows=500000.00 loops=1)
Buffers: shared hit=5212 read=1
-> Seq Scan on hj_nullfrac_outer o (cost=0.00..7213.00 rows=500000 width=8) (actual time=0.008..31.449 rows=500000.00 loops=1)
Buffers: shared hit=2213
-> Memoize (cost=0.30..0.32 rows=1 width=8) (actual time=0.000..0.000 rows=1.00 loops=500000)
Cache Key: o.val
Cache Mode: logical
Estimates: capacity=1000 distinct keys=1000 lookups=500000 hit percent=99.80%
Hits: 499000 Misses: 1000 Evictions: 0 Overflows: 0 Memory Usage: 106kB
Buffers: shared hit=2999 read=1
-> Index Scan using hj_nullfrac_inner_val_idx on hj_nullfrac_inner i (cost=0.29..0.31 rows=1 width=8) (actual time=0.002..0.002 rows=1.00 loops=1000)
Index Cond: (val = o.val)
Index Searches: 1000
Buffers: shared hit=2999 read=1
Planning:
Buffers: shared hit=87 read=7
Planning Time: 0.419 ms
Execution Time: 303.107 ms
(18 rows)
With v2+stanullfrac adjustment, we get a Hash Join, that is faster:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1347.00..15436.61 rows=500161 width=16) (actual time=6.054..153.595 rows=500000.00 loops=1)
Hash Cond: (o.val = i.val)
Buffers: shared hit=2435
-> Seq Scan on hj_nullfrac_outer o (cost=0.00..7213.00 rows=500000 width=8) (actual time=0.008..31.597 rows=500000.00 loops=1)
Buffers: shared hit=2213
-> Hash (cost=722.00..722.00 rows=50000 width=8) (actual time=6.041..6.042 rows=1000.00 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 552kB
Buffers: shared hit=222
-> Seq Scan on hj_nullfrac_inner i (cost=0.00..722.00 rows=50000 width=8) (actual time=0.004..3.016 rows=50000.00 loops=1)
Buffers: shared hit=222
Planning:
Buffers: shared hit=6
Planning Time: 0.130 ms
Execution Time: 167.903 ms
(14 rows)
Here is elog debugging comparing v2 vs v2+stanullfrac adjustment, for
the above example:
v2:
ndistinct=1003.000000
nbuckets=65536.000000
stanullfrac=0.979933
mcv_freq=0.000020
estfract before clamping=0.000997
estfract after clamping=0.000997
v2+stanullfrac adjustment:
ndistinct=972.000000
nbuckets=65536.000000
stanullfrac=0.980567
mcv_freq=0.000020
estfract before stanullfrac adjustment=0.001029
estfract after stanullfrac adjustment=0.000020
estfract after clamping=0.000020
/Joel
| Attachment | Content-Type | Size |
|---|---|---|
| fix-stanullfrac.patch | application/octet-stream | 1.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Heikki Linnakangas | 2026-03-03 15:11:05 | Re: Fix bug in multixact Oldest*MXactId initialization and access |
| Previous Message | Anthonin Bonnefoy | 2026-03-03 14:13:37 | Re: Don't keep closed WAL segment in page cache after replay |