| From: | Enrique Sánchez <enriqueesanchz(at)gmail(dot)com> |
|---|---|
| To: | Chengpeng Yan <chengpeng_yan(at)outlook(dot)com> |
| Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com> |
| Subject: | Re: Extended statistics improvement: multi-column MCV missing values |
| Date: | 2026-06-04 01:40:51 |
| Message-ID: | CAOCkzA=w6UJfMB3Td-8fu2L6FMsTnymVFD6suzJAFiT+2NJGfQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi, thank you for having a look.
On Jun 01, 2026 at 08:32, Chengpeng Yan (<chengpeng_yan(at)outlook(dot)com>) wrote:
> I think the important first boundary is the clause shape, not the choice
> between the cap and the ndistinct-based average estimate. The strongest
> case is a top-level AND condition with equality clauses for all columns
> in the multivariate MCV statistic, where the queried value combination
> is absent from the MCV list. In that case the miss is useful negative
> information.
>
> For that same case, the least-MCV-frequency cap is a useful bound and
> fallback: the combination should not be estimated as more frequent than
> the least frequent item kept in the MCV list. If matching ndistinct
> statistics exist for the same column set, the ndistinct-based average
> estimate also seems valid, as Ilia noted earlier in this thread. I do
> not think the first patch needs to make `CREATE STATISTICS ... (mcv)`
> store an ndistinct value. It can use matching ndistinct statistics when
> they exist, while changing what an MCV statistic contains seems like a
> separate decision.
Makes sense, I agree that changing what an MCV statistic contains is a
separate decision.
> I am also not sure IN/ANY should be included in the first version. For
> that case we first have to define the distinct semantic full groups
> represented by the clause, not just look at raw array elements. For
> example, `a = 0 AND b IN (99, 99)` represents one full group, not two,
>
Regarding array deduplication, i.e. `b in (99, 99)` Postgres does not check
if there are repeated elements right now, and uses them as different
elements to estimate; e.g.
```
postgres=# CREATE TABLE psql_hackers(a int, b int);
CREATE TABLE
postgres=# INSERT INTO psql_hackers SELECT a, b FROM generate_series(1,
100) a, generate_series(1, 100) b;
INSERT 0 9900
postgres=# EXPLAIN SELECT * FROM psql_hackers WHERE a IN (2);
QUERY PLAN
----------------------------------------------------------------
Seq Scan on psql_hackers (cost=0.00..167.75 rows=100 width=8)
Filter: (a = 2)
(2 rows)
postgres=# EXPLAIN SELECT * FROM psql_hackers WHERE a IN (2, 2);
QUERY PLAN
----------------------------------------------------------------
Seq Scan on psql_hackers (cost=0.00..167.75 rows=200 width=8)
Filter: (a = ANY ('{2,2}'::integer[]))
(2 rows)
```
I think we should be consistent with what Postgres already does now. It
seems like a potential future patch.
and NULL or empty arrays need similar care. That issue exists whether
> the result is used as a cap or as an ndistinct-based average estimate.
NULLs and empty arrays can appear in a MCV entry so they should be treated
the same.
So, for the basic functionality we need to check that:
- All MCV dimensions are used in the clause
- All clauses are one of:
- Equality (=)
- IS NULL
- Bool operation: (= TRUE), (= FALSE)
We also need to check that there are no expressions in the clause, as they
could refer to multiple values e.g. `mod(a, 7) = 0`
> Similarly, I would leave OR for a follow-up patch. As discussed, that
> path would apply the cap while estimating the `P(A AND B)` part of an OR
> estimate. That is a different estimation problem from the basic
> top-level AND miss case, and probably needs its own design discussion.
Yes, I've been investigating further and it makes more sense to have a
follow-up discussion later.
So my preference would be to start with the smallest case:
> 1. Full-dimensional top-level AND equality miss, using the
> ndistinct-based average estimate when matching ndistinct statistics
> exist, with the least-MCV frequency as an upper bound; otherwise
> using the cap alone.
> 2. Later patches for IN/ANY and OR handling.
>
> That would make the first patch easier to reason about and review, while
> still leaving room for the stronger cases later.
I've attached the v2 patch, covering only the full-dimensional top-level
AND equality miss.
Right now it only caps to the least-MCV frequency, but I will send a
follow-up patch with the ndistinct part asap.
Best regards,
Enrique.
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Cap-selectivity-when-values-are-not-in-multi-colu.patch | text/x-patch | 10.4 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Fujii Masao | 2026-06-04 02:05:11 | Re: Fix column privileges for pg_subscription.subwalrcvtimeout |
| Previous Message | jian he | 2026-06-04 01:38:30 | Re: CREATE TABLE LIKE INCLUDING TRIGGERS |