Quick Links

Extended statistics improvement: multi-column MCV missing values

From:	Enrique Sánchez <enriqueesanchz(at)gmail(dot)com>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Extended statistics improvement: multi-column MCV missing values
Date:	2026-05-18 16:09:03
Message-ID:	CAOCkzAkmnAjEWxfP-vBTRta7cHe5CMu8i4SFSeOi2nS8+Kj7Ng@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Postgres only uses multi-column MCVs when the value we are looking for is
in the list. If not, it falls back into individual independent statistics
to estimate selectivity.
However, a miss in a multi-column MCV list still yields valuable
information that it currently throws away: we know that the combination's
frequency is strictly bounded by the frequency of the last (least common)
item in that MCV list.

Test case
=========
```
-- 1. Create a minimal table
DROP TABLE IF EXISTS planner_trap;
CREATE TABLE planner_trap (
col_a integer,
col_b integer,
col_c integer,
sort_col integer
);

-- 2. 1 becomes the most common value (MCV) for all three columns.
INSERT INTO planner_trap (col_a, col_b, col_c, sort_col)
SELECT
(pow(random(), 5) * 100)::integer + 1,
(pow(random(), 5) * 10)::integer + 1,
(pow(random(), 5) * 50)::integer + 1,
i
FROM generate_series(1, 100000) AS i;

-- 3. Create indexes
CREATE INDEX idx_planner_trap_sort ON planner_trap(sort_col);
CREATE INDEX idx_planner_trap_a_b_c ON planner_trap(col_a, col_b, col_c);

-- 4. Make the exact combination (1, 1, 1) yield zero rows. This ensures it
won't appear
-- in the MCV list, even though the value '1' remains the most common for
each individual column.
UPDATE planner_trap
SET col_a = 2
WHERE col_a = 1 AND col_b = 1 AND col_c = 1;

-- 5. Create MCV statistics
CREATE STATISTICS (mcv) on col_a, col_b, col_c FROM planner_trap;
ANALYZE planner_trap;
```

The planner multiplies the individual selectivities and overestimates the
row count (5909; actual=0):
```
postgres=# EXPLAIN ANALYZE SELECT * FROM planner_trap WHERE col_a = 1 AND
col_b = 1 AND col_c = 1 ORDER BY sort_col DESC LIMIT 10;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..7.72 rows=10 width=16) (actual time=14.632..14.633
rows=0.00 loops=1)
Buffers: shared hit=16355 dirtied=1
-> Index Scan Backward using idx_planner_trap_sort on planner_trap
(cost=0.29..4386.99 rows=5909 width=16) (actual time=14.630..14.630
rows=0.00 loops=1)
Filter: ((col_a = 1) AND (col_b = 1) AND (col_c = 1))
Rows Removed by Filter: 100000
Index Searches: 1
Buffers: shared hit=16355 dirtied=1
Planning:
Buffers: shared hit=33 read=1
Planning Time: 0.478 ms
Execution Time: 14.651 ms
```

1. Cap selectivity with the last MCV item's frequency
=====================================================
Applying the last MCV cap here, Postgres estimates 117 rows: 0.001167 *
100000 = 117
```
postgres=# SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on
(oid = stxoid),
pg_mcv_list_items(stxdmcv) m WHERE stxname =
'planner_trap_col_a_col_b_col_c_stat' and index = 99;
index | values | nulls | frequency | base_frequency
-------+----------+---------+-----------------------+----------------
99 | {1,1,31} | {f,f,f} | 0.0011666666666666668 | 0.001049409536
```

making postgres to choose a better plan:
```
Limit (cost=300.72..300.75 rows=10 width=16) (actual time=0.045..0.046
rows=0.00 loops=1)
Buffers: shared hit=3 read=2
-> Sort (cost=300.72..301.02 rows=117 width=16) (actual
time=0.043..0.044 rows=0.00 loops=1)
Sort Key: sort_col DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=3 read=2
-> Bitmap Heap Scan on planner_trap (cost=5.78..298.19 rows=117
width=16) (actual time=0.026..0.027 rows=0.00 loops=1)
Recheck Cond: ((col_a = 1) AND (col_b = 1) AND (col_c = 1))
Buffers: shared read=2
-> Bitmap Index Scan on idx_planner_trap_a_b_c
(cost=0.00..5.76 rows=117 width=0) (actual time=0.019..0.020 rows=0.00
loops=1)
Index Cond: ((col_a = 1) AND (col_b = 1) AND (col_c =
1))
Index Searches: 1
Buffers: shared read=2
Planning:
Buffers: shared hit=51 read=8 dirtied=4
Planning Time: 0.742 ms
Execution Time: 0.094 ms
```

2. Estimate selectivity as Postgres does for single-column values not in
MCVs
=============================================================================
While that significantly improves estimations, we could mirror what
Postgres already does for individual MCVs. Quote from the official
documentation:
> The approach is to use the fact that the value is not in the list,
combined with the knowledge of the frequencies for all of the MCVs:
> That is, add up all the frequencies for the MCVs and subtract them from
one, then divide by the number of other distinct values.

To achieve this, we need to store an ndistinct estimation alongside the
MCVs that can be used for partial or entire column match.

P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) -
MCV_list_size)

```
postgres=# SELECT sum(m.frequency) FROM pg_statistic_ext join
pg_statistic_ext_data on (oid = stxoid),
pg_mcv_list_items(stxdmcv) m WHERE stxname =
'planner_trap_col_a_col_b_col_c_stat';
sum
---------------------
0.39456666666666645

postgres=# SELECT stxkeys AS k, stxdndistinct AS nd
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stts2';
k | nd

-------+---------------------------------------------------
1 2 3 | [...{"attributes": [1, 2, 3], "ndistinct": 8511}]
```

Then (1 - 0.39456666666666645) / (8511 - 100) = 0.000071981; 0.000071981 *
100000 = 7 rows
```
Limit (cost=30.30..30.32 rows=7 width=16) (actual time=0.035..0.036
rows=0.00 loops=1)
Buffers: shared hit=2
-> Sort (cost=30.30..30.32 rows=7 width=16) (actual time=0.033..0.034
rows=0.00 loops=1)
Sort Key: sort_col DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=2
-> Bitmap Heap Scan on planner_trap (cost=4.38..30.20 rows=7
width=16) (actual time=0.028..0.029 rows=0.00 loops=1)
Recheck Cond: ((col_a = 1) AND (col_b = 1) AND (col_c = 1))
Buffers: shared hit=2
-> Bitmap Index Scan on idx_planner_trap_a_b_c
(cost=0.00..4.38 rows=7 width=0) (actual time=0.022..0.022 rows=0.00
loops=1)
Index Cond: ((col_a = 1) AND (col_b = 1) AND (col_c =
1))
Index Searches: 1
Buffers: shared hit=2
Planning Time: 0.192 ms
Execution Time: 0.057 ms
```

I think this is a cheap way to prevent bad estimations. The storage
overhead of adding an ndistinct field is trivial compared to the MCV list
itself, and the O(1) arithmetic during planning adds no measurable
overhead. I look forward to your feedback before drafting a patch.

Best,
Enrique.

Responses

Re: Extended statistics improvement: multi-column MCV missing values at 2026-05-18 22:13:39 from Ilia Evdokimov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Daniel Gustafsson	2026-05-18 16:17:22	Re: Fix typo 586/686 in atomics/arch-x86.h
Previous Message	Daniel Gustafsson	2026-05-18 15:50:07	Re: remove obsolete comment in AtEOXact_Inval