Re: Extended statistics improvement: multi-column MCV missing values

From: Chengpeng Yan <chengpeng_yan(at)outlook(dot)com>
To: Enrique Sánchez <enriqueesanchz(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
Subject: Re: Extended statistics improvement: multi-column MCV missing values
Date: 2026-06-01 06:32:20
Message-ID: 37F4735D-23FB-422F-95B4-681C39601A57@outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

> On May 19, 2026, at 00:09, Enrique Sánchez <enriqueesanchz(at)gmail(dot)com> wrote:
>
> Hi,
>
> Postgres only uses multi-column MCVs when the value we are looking for is in the list. If not, it falls back into individual independent statistics to estimate selectivity.
> However, a miss in a multi-column MCV list still yields valuable information that it currently throws away: we know that the combination's frequency is strictly bounded by the frequency of the last (least common) item in that MCV list.
>
> 1. Cap selectivity with the last MCV item's frequency
> =====================================================
> Applying the last MCV cap here, Postgres estimates 117 rows: 0.001167 * 100000 = 117
>
>
> 2. Estimate selectivity as Postgres does for single-column values not in MCVs
> =============================================================================
> While that significantly improves estimations, we could mirror what Postgres already does for individual MCVs. Quote from the official documentation:
> > The approach is to use the fact that the value is not in the list, combined with the knowledge of the frequencies for all of the MCVs:
> > That is, add up all the frequencies for the MCVs and subtract them from one, then divide by the number of other distinct values.
>
> To achieve this, we need to store an ndistinct estimation alongside the MCVs that can be used for partial or entire column match.
>
> P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) - MCV_list_size)
>
> I think this is a cheap way to prevent bad estimations. The storage overhead of adding an ndistinct field is trivial compared to the MCV list itself, and the O(1) arithmetic during planning adds no measurable overhead. I look forward to your feedback before drafting a patch.

Thanks for working on this.

I think the general idea is a good one: an MCV miss still carries useful
information, and it seems worth using that information instead of
falling back entirely to independent estimates.

I have not done a full code review of the patch yet, so this is mostly
feedback on the design and scope.

I think the important first boundary is the clause shape, not the choice
between the cap and the ndistinct-based average estimate. The strongest
case is a top-level AND condition with equality clauses for all columns
in the multivariate MCV statistic, where the queried value combination
is absent from the MCV list. In that case the miss is useful negative
information.

For that same case, the least-MCV-frequency cap is a useful bound and
fallback: the combination should not be estimated as more frequent than
the least frequent item kept in the MCV list. If matching ndistinct
statistics exist for the same column set, the ndistinct-based average
estimate also seems valid, as Ilia noted earlier in this thread. I do
not think the first patch needs to make `CREATE STATISTICS ... (mcv)`
store an ndistinct value. It can use matching ndistinct statistics when
they exist, while changing what an MCV statistic contains seems like a
separate decision.

I am also not sure IN/ANY should be included in the first version. For
that case we first have to define the distinct semantic full groups
represented by the clause, not just look at raw array elements. For
example, `a = 0 AND b IN (99, 99)` represents one full group, not two,
and NULL or empty arrays need similar care. That issue exists whether
the result is used as a cap or as an ndistinct-based average estimate.

Similarly, I would leave OR for a follow-up patch. As discussed, that
path would apply the cap while estimating the `P(A AND B)` part of an OR
estimate. That is a different estimation problem from the basic
top-level AND miss case, and probably needs its own design discussion.

So my preference would be to start with the smallest case:
1. Full-dimensional top-level AND equality miss, using the
ndistinct-based average estimate when matching ndistinct statistics
exist, with the least-MCV frequency as an upper bound; otherwise
using the cap alone.
2. Later patches for IN/ANY and OR handling.

That would make the first patch easier to reason about and review, while
still leaving room for the stronger cases later.

Those are just my initial thoughts, so I may be missing some details in
the patch.

--
Best regards,
Chengpeng Yan

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Kyotaro Horiguchi 2026-06-01 06:30:58 Re: pg_rewind does not rewind diverging timelines