Quick Links

Re: Extended statistics improvement: multi-column MCV missing values

From:	Enrique Sánchez <enriqueesanchz(at)gmail(dot)com>
To:	Chengpeng Yan <chengpeng_yan(at)outlook(dot)com>
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
Subject:	Re: Extended statistics improvement: multi-column MCV missing values
Date:	2026-06-04 01:40:51
Message-ID:	CAOCkzA=w6UJfMB3Td-8fu2L6FMsTnymVFD6suzJAFiT+2NJGfQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi, thank you for having a look.

On Jun 01, 2026 at 08:32, Chengpeng Yan (<chengpeng_yan(at)outlook(dot)com>) wrote:

> I think the important first boundary is the clause shape, not the choice
> between the cap and the ndistinct-based average estimate. The strongest
> case is a top-level AND condition with equality clauses for all columns
> in the multivariate MCV statistic, where the queried value combination
> is absent from the MCV list. In that case the miss is useful negative
> information.
>
> For that same case, the least-MCV-frequency cap is a useful bound and
> fallback: the combination should not be estimated as more frequent than
> the least frequent item kept in the MCV list. If matching ndistinct
> statistics exist for the same column set, the ndistinct-based average
> estimate also seems valid, as Ilia noted earlier in this thread. I do
> not think the first patch needs to make `CREATE STATISTICS ... (mcv)`
> store an ndistinct value. It can use matching ndistinct statistics when
> they exist, while changing what an MCV statistic contains seems like a
> separate decision.

Makes sense, I agree that changing what an MCV statistic contains is a
separate decision.

> I am also not sure IN/ANY should be included in the first version. For
> that case we first have to define the distinct semantic full groups
> represented by the clause, not just look at raw array elements. For
> example, `a = 0 AND b IN (99, 99)` represents one full group, not two,
>
Regarding array deduplication, i.e. `b in (99, 99)` Postgres does not check
if there are repeated elements right now, and uses them as different
elements to estimate; e.g.

```
postgres=# CREATE TABLE psql_hackers(a int, b int);
CREATE TABLE
postgres=# INSERT INTO psql_hackers SELECT a, b FROM generate_series(1,
100) a, generate_series(1, 100) b;
INSERT 0 9900
postgres=# EXPLAIN SELECT * FROM psql_hackers WHERE a IN (2);
QUERY PLAN
----------------------------------------------------------------
Seq Scan on psql_hackers (cost=0.00..167.75 rows=100 width=8)
Filter: (a = 2)
(2 rows)

postgres=# EXPLAIN SELECT * FROM psql_hackers WHERE a IN (2, 2);
QUERY PLAN
----------------------------------------------------------------
Seq Scan on psql_hackers (cost=0.00..167.75 rows=200 width=8)
Filter: (a = ANY ('{2,2}'::integer[]))
(2 rows)
```
I think we should be consistent with what Postgres already does now. It
seems like a potential future patch.

and NULL or empty arrays need similar care. That issue exists whether
> the result is used as a cap or as an ndistinct-based average estimate.

NULLs and empty arrays can appear in a MCV entry so they should be treated
the same.

So, for the basic functionality we need to check that:
- All MCV dimensions are used in the clause
- All clauses are one of:
- Equality (=)
- IS NULL
- Bool operation: (= TRUE), (= FALSE)

We also need to check that there are no expressions in the clause, as they
could refer to multiple values e.g. `mod(a, 7) = 0`

> Similarly, I would leave OR for a follow-up patch. As discussed, that
> path would apply the cap while estimating the `P(A AND B)` part of an OR
> estimate. That is a different estimation problem from the basic
> top-level AND miss case, and probably needs its own design discussion.

Yes, I've been investigating further and it makes more sense to have a
follow-up discussion later.

So my preference would be to start with the smallest case:
> 1. Full-dimensional top-level AND equality miss, using the
> ndistinct-based average estimate when matching ndistinct statistics
> exist, with the least-MCV frequency as an upper bound; otherwise
> using the cap alone.
> 2. Later patches for IN/ANY and OR handling.
>
> That would make the first patch easier to reason about and review, while
> still leaving room for the stronger cases later.

I've attached the v2 patch, covering only the full-dimensional top-level
AND equality miss.
Right now it only caps to the least-MCV frequency, but I will send a
follow-up patch with the ndistinct part asap.

Best regards,
Enrique.

Attachment	Content-Type	Size
v2-0001-Cap-selectivity-when-values-are-not-in-multi-colu.patch	text/x-patch	10.4 KB

In response to

Re: Extended statistics improvement: multi-column MCV missing values at 2026-06-01 06:32:20 from Chengpeng Yan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Fujii Masao	2026-06-04 02:05:11	Re: Fix column privileges for pg_subscription.subwalrcvtimeout
Previous Message	jian he	2026-06-04 01:38:30	Re: CREATE TABLE LIKE INCLUDING TRIGGERS