Quick Links

Re: Extended statistics improvement: multi-column MCV missing values

From:	Enrique Sánchez <enriqueesanchz(at)gmail(dot)com>
To:	Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Extended statistics improvement: multi-column MCV missing values
Date:	2026-05-24 18:34:52
Message-ID:	CAOCkzAmfFgGgZggjx-y+tc3JDQZhgKf6MGT18e15df61rTLH9w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Ilia,

> 1. Cap selectivity with the last MCV item's frequency

I've attached a draft patch. It's split into 4 commits to make it easier to
review. It adds the MCV cap for AND (equality, IN, ANY) clauses. Looking
forward to your feedback.

On 5/19/26 0:13, Ilia Evdokimov wrote:

> Postgres only uses multi-column MCVs when the value we are looking for is
> in the list. If not, it falls back into individual independent statistics
> to estimate selectivity.
> However, a miss in a multi-column MCV list still yields valuable
> information that it currently throws away: we know that the combination's
> frequency is strictly bounded by the frequency of the last (least common)
> item in that MCV list.
>
> LGTM. If the multicolumn MCV statistics exists and the clause combination
> is absent from the MCV-list, we can use the least frequent MCV item as an
> upper bound. BTW, this only applies to AND-clauses.
>
Given that the inclusion-exclusion principle (P(A OR B) = P(A) + P(B) - P(A
AND B)) is used to calculate OR paths, we could use the same cap to improve
the overlap estimation (P(A AND B)).

2. Estimate selectivity as Postgres does for single-column values not in
> MCVs
>
> =============================================================================
> While that significantly improves estimations, we could mirror what
> Postgres already does for individual MCVs. Quote from the official
> documentation:
> > The approach is to use the fact that the value is not in the list,
> combined with the knowledge of the frequencies for all of the MCVs:
> > That is, add up all the frequencies for the MCVs and subtract them from
> one, then divide by the number of other distinct values.
>
> To achieve this, we need to store an ndistinct estimation alongside the
> MCVs that can be used for partial or entire column match.
>
> P(1, 1, 1) = (1 - sum(MCVs)) / (ndistinct(col_a, col_b, col_c) -
> MCV_list_size)
> ...
> I think this is a cheap way to prevent bad estimations. The storage
> overhead of adding an ndistinct field is trivial compared to the MCV list
> itself, and the O(1) arithmetic during planning adds no measurable
> overhead. I look forward to your feedback before drafting a patch.
>
> For this, the ndistinct extended statistics already exist. If both MCV and
> ndistinct statistics are present on the same column set, the formula is
> correct. There are already places in the code that compute ndistinct for
> columns without extended ndistinct statistics (see estimate_num_groups) -
> but it is worth thinking carefully about whether the added complexity is
> justified before going down that path.
>
I think that the implementation would look similar to the attached patch.
The only added complexity is getting the ndistinct estimation. There are 2
ways:
- Rely on the extended ndistinct statistic if it exists
- Calculate the ndistinct(col_a, col_b, col_c) statistic as part of the
MCV. The storage cost is negligible compared to the MCV list and it keeps
the MCV statistic complete, so it doesn't need to check if the ndistinct
statistic exists. That would be a change on the MCV struct, meaning a
change in the mcvlist->type.

I can start working on a proposal for this second part. Let me know what
you think.

Best regards,
Enrique.

Attachment	Content-Type	Size
v1-0004-Extend-multi-column-MCV-cap-to-AND-clauses-inside-OR.patch	text/x-patch	8.6 KB
v1-0001-Cap-selectivity-when-values-are-not-in-multi-column-.patch	text/x-patch	9.1 KB
v1-0002-Add-support-for-IN-ANY-clauses-in-multi-column-MCV-c.patch	text/x-patch	4.9 KB
v1-0003-Extend-multicolumn-MCV-cap-to-partial-IN-ANY-matches.patch	text/x-patch	3.5 KB

In response to

Re: Extended statistics improvement: multi-column MCV missing values at 2026-05-18 22:13:39 from Ilia Evdokimov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Sehrope Sarkuni	2026-05-24 20:04:07	Tighten SCRAM iteration parsing and bound libpq PBKDF2 work
Previous Message	Mats Kindahl	2026-05-24 18:30:17	Re: pg_rewind does not rewind diverging timelines