Re: Hash-based MCV matching for large IN-lists

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: David Geier <geidav(dot)pg(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hash-based MCV matching for large IN-lists
Date: 2026-01-27 15:43:00
Message-ID: cf175ba7-a35b-4ebb-8764-254d5a2ace19@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 19.01.2026 17:01, David Geier wrote:
> Does that mean that we get a different estimation result, depending on
> if the IN list is smaller or not? I think we should avoid that because
> estimation quality might flip for the user unexpectedly.

I think you're right.

To address this, I changed the hash-table entry to track an additional
'count' filed, representing how many times a particular value appears on
the hashed side. When inserting into the hash table, if the value is
already present, I increment 'count', otherwise, I create a new entry
with count = 1

>>> The code in master currently calls an operator-specific selectivity
>>> estimation function. For equality this is typically eqsel() but the
>>> function can be specified during CREATE OPERATOR.
>>>
>>> Can be safely special-case the behavior of eqsel() for all possible
>>> operators for the ScalarArrayOpExpr case?
>>
>> Unfortunately there is no safe way to make this optimization generic for
>> arbitrary restrict functions, because a custom RESTRICT function does
>> not have to use MCVs at all. IMO, in practice the vast majority of
>> ScalarArrayOpExpr uses with = or <> rely on the built-in equality
>> operators whose selectivity is computed by eqsel()/neqsel(), so I
>> limited this optimization to those cases.
> How did you do that? I cannot find the code that checks for that.

In scalararraysel(), before attempting the hash-based path, we determine
whether the operator behaves like equality or inequality based on its
selectivity function:

if (oprsel == F_EQSEL || oprsel == F_EQJOINSEL)
    isEquality = true;
else if (oprsel == F_NEQSEL || oprsel == F_NEQJOINSEL)
    isInequality = true;

Then the hash-based MCV matching is only attempted under:

if ((isEquality || isInequality) && !is_join_clause)

So effectively this restricts the optimization to operators whose
selectivity is computed by eqsel()/neqsel() on restriction clauses. Join
clauses (which would use eqjoinsel/neqjoinsel) are excluded via
!is_join_clause

> For the MCVs, can't we reuse some code from the eqjoinsel() optimization
> we did? The entry and context structs look similar enough to only need one.

I considered reusing pieces from the eqjoinsel() , but in practice it
turned out to be difficult to share code cleanly. Also, when looking at
this file more broadly, we already have multiple places that reimplement
similar pattern.

> Making the code more compact would ease reviewing a lot.

Agreed — I also think making the code more compact would significantly
ease reviewing. I’ve found a way to unify the Const-array and ArrayExpr
cases: in the ArrayExpr path, we can first construct the same arrays as
in the Const-array case (elem_values, elem_nulls), and additionally
build a boolean array elem_const[] indicating whether each element is a
Const. Then the hash-based MCV matching function can:

- Ignore NULL and non-Const elements when building and probing the hash
table.
- Count how many non-Const elements are present.
- After MCV and non-MCV constant handling, account for non-Const
elements separately using var_eq_non_const() and fold their
probabilities into the same ANY/ALL accumulation logic.

I've attached v3 patch with it.

To validate the same estimation results, I temporarily kept both
implementations (hash-based and nested-loop) and compared their
resulting selectivity values. Whenever they differed, I logged it. I ran
regression tests and some local workload testing with this check
enabled, and did not observe any mismatches. I attached patch with this
logging.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

Attachment Content-Type Size
hash_mcv_in_logging.patch text/x-patch 1.5 KB
v3-0001-Use-hash-based-MCV-matching-for-ScalarArrayOpExpr.patch text/x-patch 21.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Matheus Alcantara 2026-01-27 15:42:10 Re: Enable partitionwise join for partition keys wrapped by RelabelType