Quick Links

Re: Hash-based MCV matching for large IN-lists

From:	Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To:	Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>
Cc:	Tatsuya Kawata <kawatatatsuya0913(at)gmail(dot)com>, David Geier <geidav(dot)pg(at)gmail(dot)com>, Chengpeng Yan <chengpeng_yan(at)outlook(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Hash-based MCV matching for large IN-lists
Date:	2026-02-25 22:45:44
Message-ID:	9067a807-f130-4631-8df4-076c6d7e43b6@tantorlabs.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

I've addressed the review comments mentioned above.

David made a very good observation: for unique columns, where each
iteration effectively returns the same per-element selectivity, there is
no need to iterate at all. In such cases we can reduce the computation
to a closed-form expression, i.e. O(1) instead of running the loop O(N).

I applied this idea to unique columns and cases falling back to
DEFAULT_EQ_SEL. In both cases the loop can be replaced with a
closed-from formula implemented in calculate_combined_selectivity(). The
formula mirros the existing independent/disjoint probability model: ANY
(sel = 1 - (1 - s) ^ length or length * s ), ALL (sel = s ^ length or 1
- length*(1 - s)). It would be good to carefully review that this is
fully equivalent to the current accumulation logic.

I also exprimented with applying the same idea to elements that are not
found in MCV, are not Const, and effectively found in MCV with more than
one count. Those cases can still be accumulated using
accum_scalararray_prob(), but potentially grouped to reduce repeated work.

Overall, the optimization work can be logically split into three parts:

1. Degenerate NULL case O(N) -> O(1) [0]
2. Identical non-NULL per-element selectivity O(N) -> O(1) (can be
split into a separate thread if prederred)
3. MCV matching via hashing O(M*N) -> O(M+N) (current thread)

Feedback on how to best structure or split this work would be appreciated.

About op_is_reserved. It seems we should assign op_is_reserved = true,
because we don't reverse types like eqjoinsel_semi(). If IN-list smaller
than MCV-list we reverse it by fmgr_info(hash_mcv ? hashLeft :
hashRight, &hash_proc). Thanks for this remark.

Thoughts?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

Attachment	Content-Type	Size
v6-0003-Use-hash-based-MCV-matching-for-ScalarArrayOpExpr.patch	text/x-patch	17.5 KB
v6-0002-Use-O-1-selectivity-formula-for-eqsel-neqsel-IN-A.patch	text/x-patch	6.4 KB
v6-0001-Reduce-planning-time-for-large-NOT-IN-lists-conta.patch	text/x-patch	4.9 KB

In response to

Re: Hash-based MCV matching for large IN-lists at 2026-02-25 21:39:05 from Zsolt Parragi

Responses

Re: Hash-based MCV matching for large IN-lists at 2026-02-26 08:57:26 from Ilia Evdokimov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Zsolt Parragi	2026-02-25 23:18:04	Re: amcheck: add index-all-keys-match verification for B-Tree
Previous Message	Tom Lane	2026-02-25 22:32:11	Fixing grouping expressions inside subqueries