[BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq
Date: 2026-02-24 18:21:06
Message-ID: 341b723c-da45-4058-9446-1514dedb17c1@app.fastmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I think I might have stumbled upon a planner bug in
src/backend/utils/adt/selfuncs.c, while working on something else.

I've traced it down to bd3e3e9 (Ensure sanity of hash-join costing when
there are no MCV statistics.) that added a fallback to compute
mcv_freq, which surfaced an existing bug, due to a branch that was not
taken when mcv_freq was zero.

if (avgfreq > 0.0 && *mcv_freq > avgfreq)
estfract *= *mcv_freq / avgfreq;

Here, since this expression is a ratio, it seems important that both the
numerator and divisor are of the same base dimension.

mcv_freq is [freq of most common values in filtered relation]

However, avgfreq is currently [avg freq of all distinct data values in
raw relation]:

/* Compute avg freq of all distinct data values in raw relation */
avgfreq = (1.0 - stanullfrac) / ndistinct;

After avgfreq has been assigned, ndistinct is then adjusted, to account
for restriction clauses:

if (vardata.rel && vardata.rel->tuples > 0)
{
ndistinct *= vardata.rel->rows / vardata.rel->tuples;
ndistinct = clamp_row_est(ndistinct);
}

The expression further down

estfract *= *mcv_freq / avgfreq;

therefore divides two frequencies of different bases, when
I think they should be of the same base.

If we simply move the computation of avgfreq to after the adjustment
of ndistinct, avgfreq will be of the same base as mcv_freq.

This bug seems to sometimes cause the wrong table, the larger table, to
be hashed in a Hash Join, and the smaller table to be used for probing.

Here is a demo:

CREATE TABLE orders (
order_id bigint NOT NULL,
tracking_code bigint,
region int NOT NULL,
status text NOT NULL,
order_data text
);

CREATE TABLE shipments (
shipment_id bigint NOT NULL,
tracking_code bigint NOT NULL,
carrier text NOT NULL,
ship_data text
);

INSERT INTO orders (order_id, tracking_code, region, status, order_data)
SELECT
g,
CASE WHEN g <= 150000 THEN g ELSE NULL END,
(g % 1000) + 1,
CASE WHEN g <= 150000 THEN 'shipped' ELSE 'pending' END,
'order-' || g
FROM generate_series(1, 200000) g;

INSERT INTO shipments (shipment_id, tracking_code, carrier, ship_data)
SELECT
g,
g,
CASE (g % 3) WHEN 0 THEN 'FedEx' WHEN 1 THEN 'UPS' ELSE 'DHL' END,
repeat('x', 200) || g::text
FROM generate_series(1, 150000) g;

CREATE INDEX ON orders (region);

ANALYZE orders;
ANALYZE shipments;

EXPLAIN ANALYZE
SELECT o.order_id, o.tracking_code, s.carrier, s.ship_data
FROM orders o
JOIN shipments s ON s.tracking_code = o.tracking_code
WHERE o.region = 42;

master (65707ed):
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=9138.25..14237.55 rows=149 width=229) (actual time=70.984..78.977 rows=150.00 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=6557, temp read=3987 written=4068
-> Parallel Hash Join (cost=8138.25..13222.65 rows=88 width=229) (actual time=68.237..75.806 rows=75.00 loops=2)
Hash Cond: (o.tracking_code = s.tracking_code)
Buffers: shared hit=6557, temp read=3987 written=4068
-> Parallel Seq Scan on orders o (cost=0.00..3188.59 rows=117 width=16) (actual time=0.076..12.053 rows=100.00 loops=2)
Filter: (region = 42)
Rows Removed by Filter: 99900
Buffers: shared hit=1718
-> Parallel Hash (cost=5464.00..5464.00 rows=62500 width=221) (actual time=53.216..53.217 rows=75000.00 loops=2)
Buckets: 32768 Batches: 8 Memory Usage: 5120kB
Buffers: shared hit=4839, temp written=4004
-> Parallel Seq Scan on shipments s (cost=0.00..5464.00 rows=62500 width=221) (actual time=0.009..16.114 rows=75000.00 loops=2)
Buffers: shared hit=4839
Planning:
Buffers: shared hit=86 read=1
Planning Time: 0.350 ms
Execution Time: 79.011 ms

0001-Fix-estimate_hash_bucket_stats-to-use-filtered-ndist.patch:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1578.75..7292.64 rows=149 width=229) (actual time=0.521..15.592 rows=150.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=5447 read=3
-> Hash Join (cost=578.75..6277.74 rows=62 width=229) (actual time=0.667..12.342 rows=50.00 loops=3)
Hash Cond: (s.tracking_code = o.tracking_code)
Buffers: shared hit=5447 read=3
-> Parallel Seq Scan on shipments s (cost=0.00..5464.00 rows=62500 width=221) (actual time=0.006..5.581 rows=50000.00 loops=3)
Buffers: shared hit=4839
-> Hash (cost=576.26..576.26 rows=199 width=16) (actual time=0.429..0.429 rows=150.00 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 16kB
Buffers: shared hit=608 read=3
-> Bitmap Heap Scan on orders o (cost=5.84..576.26 rows=199 width=16) (actual time=0.083..0.396 rows=200.00 loops=3)
Recheck Cond: (region = 42)
Heap Blocks: exact=200
Buffers: shared hit=608 read=3
-> Bitmap Index Scan on orders_region_idx (cost=0.00..5.79 rows=199 width=0) (actual time=0.039..0.039 rows=200.00 loops=3)
Index Cond: (region = 42)
Index Searches: 3
Buffers: shared hit=8 read=3
Planning:
Buffers: shared hit=86 read=1
Planning Time: 0.371 ms
Execution Time: 15.634 ms

The fix causes quite a lot of plans in
src/test/regress/expected/partition_join.out to change, which makes me a
bit worried I might have misunderstood something here. I haven't
verified if all the new plans are improvements, I just copied the result
file to the expected dir.

/Joel

Attachment Content-Type Size
0001-Fix-estimate_hash_bucket_stats-to-use-filtered-ndist.patch application/octet-stream 35.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2026-02-24 18:29:17 Re: Adding REPACK [concurrently]
Previous Message Bertrand Drouvot 2026-02-24 18:16:10 Re: Make use of unvolatize() in vac_truncate_clog()