Re: Hash-based MCV matching for large IN-lists

From: Chengpeng Yan <chengpeng_yan(at)Outlook(dot)com>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
Cc: David Geier <geidav(dot)pg(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hash-based MCV matching for large IN-lists
Date: 2026-01-28 13:08:36
Message-ID: 21FF6A0B-886F-4132-B6A6-0F4E934B4CEE@Outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jan 14, 2026, at 18:19, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com> wrote:
> I’ve attached v2 of the patch. It currently uses two fairly large helper functions for the Const and ArrayExpr cases; this is intentional to keep the logic explicit and reviewable, even though these will likely need refactoring or consolidation later.

Thanks for working on this.

I had previously reviewed the v2 patch and wrote up some comments, but
didn’t get a chance to send them before v3 was posted. I haven’t yet had
time to review v3 in detail, so I’m not sure whether the issues below
have already been addressed there. I’m posting my earlier review notes
first and will follow up with comments on v3 once I’ve had a chance to
look at it.

* Treat NULL array elements as zero selectivity for ALL:

In `scalararray_mcv_hash_match_const()` (and similarly
`scalararray_mcv_hash_match_expr()`), NULL array elements are currently
handled by simply continuing the loop (e.g. `if (elem_nulls[i])
continue;`), effectively ignoring them.

This behavior is only correct for ANY/OR semantics. For ALL/AND (`useOr
= false`), a single NULL array element causes the `ScalarArrayOpExpr` to
never return TRUE for strict operators (as assumed by the surrounding
code and comments). In that case, the correct selectivity estimate
should be 0.0, but the current code path can return a non-zero
selectivity.

* Fix cross-type equality argument order in `mcvs_in_equal`:

`mcvs_in_equal()` always invokes the equality function as `(key0,
key1)`. However, `simplehash` provides `key0` from the hash table and
`key1` as the probe key.

In the branch where the hash table is built over IN-list values and
probed with MCVs (the `sslot.nvalues > num_elems` path), this reverses
the operator’s argument order for cross-type equality operators. This
risks incorrect match decisions and may misinterpret Datums compared to
the operator’s declared signature.

* Include non-MCV IN-list constants in non-disjoint selectivity:

In the `sslot.nvalues > num_elems` path of
`scalararray_mcv_hash_match_const()` and
`scalararray_mcv_hash_match_expr()`, non-MCV constant elements currently
only contribute via `disjoint_sel`.

For cases where disjoint-probability estimation is not used (e.g. ALL,
`<> ANY`, or when `disjoint_sel` is out of range), the code leaves the
selectivity based solely on MCV matches. This effectively treats non-MCV
constants as having probability 1.0, leading to overestimation of
selectivity.

* Avoid double-negating inequality estimates for non-Const elements:

In the `scalararray_mcv_hash_match_expr()` `sslot.nvalues > num_elems`
branch, non-Const elements are handled via

`var_eq_non_const(..., negate = isInequality)`

and then later adjusted again with

`if (isInequality)
s1 = 1.0 - s1 - nullfrac;`

This results in a double negation for inequality cases, effectively
turning the estimate back into an equality selectivity.

--
Best regards,
Chengpeng Yan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2026-01-28 13:19:08 Re: Changing shared_buffers without restart
Previous Message Robert Haas 2026-01-28 13:03:02 Re: pg_waldump: support decoding of WAL inside tarfile