Re: Hash-based MCV matching for large IN-lists

From: David Geier <geidav(dot)pg(at)gmail(dot)com>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hash-based MCV matching for large IN-lists
Date: 2026-01-19 14:01:07
Message-ID: 6c068ed3-0ab1-4743-862a-343134f2bfa4@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14.01.2026 11:19, Ilia Evdokimov wrote:
> After thinking more about this I realized that this is actually a better
> match for how selectivity is currently modeled. After this comments in
> master
>
>          * If we were being really tense we would try to confirm that the
>          * elements are all distinct, but that would be expensive and it
>          * doesn't seem to be worth the cycles; it would amount to
> penalizing
>          * well-written queries in favor of poorly-written ones.
> However, we
>          * do protect ourselves a little bit by checking whether the
>          * disjointness assumption leads to an impossible (out of range)
>          * probability; if so, we fall back to the normal calculation.
>
> when the hash table is built on the IN-list, duplicate IN-list values
> are automatically eliminated during insertion, so we no longer risk
> summing the same MCV frequency multiple times. This makes the disjoint-
> probability estimate more robust and in practice slightly more accurate.

Does that mean that we get a different estimation result, depending on
if the IN list is smaller or not? I think we should avoid that because
estimation quality might flip for the user unexpectedly.

> One thing I initially missed that there are actually three different
> places where ScalarArrayOpExpr is handled - the Const array case, the
> ArrayExpr case and others - and Const and ArrayExpr require different
> implementation of the same idea. In Const case we can directly hash and
> probe Datum value, while ArrayExpr case we must work on Node* element,
> separating constant and non-constant entries and only hashing the
> constants. The current v2 therefore applies the same MCV-hash
> optimization in both branches, but using two tailored code paths that
> preserve the existing semantics of how non-Const elements are handled by
> var_eq_non_const().
>
> If the MCV list is smaller than the IN-list, the behavior is the same as
> in v1 of the patch. If the IN-list is smaller, we instead build a hash
> table over the distinct constant elements of the IN-list and then:
> - Scan the MCV list and sum the frequencies of those MCVs that appear in
> the IN-list;
> - Count how many distinct IN-list not null constant elements are not
> present in the MCV list;

Is this to make sure we keep getting the same estimation result if the
IN list is smaller and contains duplicates?

> - Estimate the probability of each such non-MCV value using the
> remaining frequency mass;
> - Handle non-constant IN-list elements separately using
> var_eq_non_const(), exactly as in the existing implementation.

OK

>>>
>> The code in master currently calls an operator-specific selectivity
>> estimation function. For equality this is typically eqsel() but the
>> function can be specified during CREATE OPERATOR.
>>
>> Can be safely special-case the behavior of eqsel() for all possible
>> operators for the ScalarArrayOpExpr case?
>
>
> Unfortunately there is no safe way to make this optimization generic for
> arbitrary restrict functions, because a custom RESTRICT function does
> not have to use MCVs at all. IMO, in practice the vast majority of
> ScalarArrayOpExpr uses with = or <> rely on the built-in equality
> operators whose selectivity is computed by eqsel()/neqsel(), so I
> limited this optimization to those cases.

How did you do that? I cannot find the code that checks for that.

> I’ve attached v2 of the patch. It currently uses two fairly large helper
> functions for the Const and ArrayExpr cases; this is intentional to keep
> the logic explicit and reviewable, even though these will likely need
> refactoring or consolidation later.

Beyond that, it seems like you can also combine/reuse a bunch of code
for creating the hash map on the IN vs on the MCV list.

For the MCVs, can't we reuse some code from the eqjoinsel() optimization
we did? The entry and context structs look similar enough to only need one.

Making the code more compact would ease reviewing a lot.

--
David Geier

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Henson Choi 2026-01-19 13:48:20 Re: Row pattern recognition