Re: Hash-based MCV matching for large IN-lists

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: David Geier <geidav(dot)pg(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hash-based MCV matching for large IN-lists
Date: 2026-01-14 10:19:36
Message-ID: 988e3168-6096-488a-bb42-787e1e8c21a4@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi David!

Thanks for feedback.

On 05.01.2026 11:54, David Geier wrote:
>> This patch introduces a hash-based matching path, analogous to what is
>> already done for MCV matching in join selectivity estimation (057012b
>> commit). Instead of linearly scanning the MCV array for each IN-list
>> element, we build a hash table and probe it to identify matches.
>>
>> The hash table is built over the MCV values, not over the IN-list. The
>> IN-list may contain NULLs, non-Const expressions, and duplicate values,
>> whereas the MCV list is guaranteed to contain distinct, non-NULL values
>> and represents the statistically meaningful domain we are matching
>> against. Hashing the MCVs therefore avoids duplicate work and directly
>> supports selectivity estimation.
> The downside of doing it this way is that we always pay the price of
> building a possibly big hash table if the column has a lot of MCVs, even
> for small IN lists. Why can't we build the hash table always on the
> smaller list, like we do already in the join selectivity estimation?
>
> For NULL we can add a flag to the hash entry, non-Const expressions must
> anyways be evaluated and duplicate values will be discarded during insert.

After thinking more about this I realized that this is actually a better
match for how selectivity is currently modeled. After this comments in
master

         * If we were being really tense we would try to confirm that the
         * elements are all distinct, but that would be expensive and it
         * doesn't seem to be worth the cycles; it would amount to
penalizing
         * well-written queries in favor of poorly-written ones.
However, we
         * do protect ourselves a little bit by checking whether the
         * disjointness assumption leads to an impossible (out of range)
         * probability; if so, we fall back to the normal calculation.

when the hash table is built on the IN-list, duplicate IN-list values
are automatically eliminated during insertion, so we no longer risk
summing the same MCV frequency multiple times. This makes the
disjoint-probability estimate more robust and in practice slightly more
accurate.

One thing I initially missed that there are actually three different
places where ScalarArrayOpExpr is handled - the Const array case, the
ArrayExpr case and others - and Const and ArrayExpr require different
implementation of the same idea. In Const case we can directly hash and
probe Datum value, while ArrayExpr case we must work on Node* element,
separating constant and non-constant entries and only hashing the
constants. The current v2 therefore applies the same MCV-hash
optimization in both branches, but using two tailored code paths that
preserve the existing semantics of how non-Const elements are handled by
var_eq_non_const().

If the MCV list is smaller than the IN-list, the behavior is the same as
in v1 of the patch. If the IN-list is smaller, we instead build a hash
table over the distinct constant elements of the IN-list and then:
- Scan the MCV list and sum the frequencies of those MCVs that appear in
the IN-list;
- Count how many distinct IN-list not null constant elements are not
present in the MCV list;
- Estimate the probability of each such non-MCV value using the
remaining frequency mass;
- Handle non-constant IN-list elements separately using
var_eq_non_const(), exactly as in the existing implementation.

>> For each IN-list element, if a matching MCV is found, we add the
>> corresponding MCV frequency to the selectivity estimate. If no match is
>> found, the remaining selectivity is estimated in the same way as the
>> existing non-MCV path (similar to var_eq_const when the constant is not
>> present in the MCV list).
>>
> The code in master currently calls an operator-specific selectivity
> estimation function. For equality this is typically eqsel() but the
> function can be specified during CREATE OPERATOR.
>
> Can be safely special-case the behavior of eqsel() for all possible
> operators for the ScalarArrayOpExpr case?

Unfortunately there is no safe way to make this optimization generic for
arbitrary restrict functions, because a custom RESTRICT function does
not have to use MCVs at all. IMO, in practice the vast majority of
ScalarArrayOpExpr uses with = or <> rely on the built-in equality
operators whose selectivity is computed by eqsel()/neqsel(), so I
limited this optimization to those cases.

I’ve attached v2 of the patch. It currently uses two fairly large helper
functions for the Const and ArrayExpr cases; this is intentional to keep
the logic explicit and reviewable, even though these will likely need
refactoring or consolidation later.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

Attachment Content-Type Size
v2-0001-Use-hash-based-matching-for-MCVs-in-ScalarArrayOp.patch text/x-patch 37.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Pierre 2026-01-14 10:18:00 Re: [PATCH] check kernel version for io_method