Quick Links

Hash-based MCV matching for large IN-lists

From:	Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Geier <geidav(dot)pg(at)gmail(dot)com>
Subject:	Hash-based MCV matching for large IN-lists
Date:	2025-12-29 20:35:43
Message-ID:	7db341e0-fbc6-4ec5-922c-11fdafe7be12@tantorlabs.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi hackers,

When estimating selectivity for ScalarArrayOpExpr (IN, ANY, ALL) and MCV
statistics are available for the column, the planner currently matches
IN-list elements against the MCV array using nested loop. For large
IN-list and large MCV arrays this results in O(N*M) behavior, which can
become unnecessarily expensive during planning.

Thanks to David for pointing out this case [0]

This patch introduces a hash-based matching path, analogous to what is
already done for MCV matching in join selectivity estimation (057012b
commit). Instead of linearly scanning the MCV array for each IN-list
element, we build a hash table and probe it to identify matches.

The hash table is built over the MCV values, not over the IN-list. The
IN-list may contain NULLs, non-Const expressions, and duplicate values,
whereas the MCV list is guaranteed to contain distinct, non-NULL values
and represents the statistically meaningful domain we are matching
against. Hashing the MCVs therefore avoids duplicate work and directly
supports selectivity estimation.

For each IN-list element, if a matching MCV is found, we add the
corresponding MCV frequency to the selectivity estimate. If no match is
found, the remaining selectivity is estimated in the same way as the
existing non-MCV path (similar to var_eq_const when the constant is not
present in the MCV list).

The hash-based path is enabled only when both a sufficiently large
IN-list and an MCV list are present, and suitable hash functions exist
for the equality operator. The threshold is currently the same as the
one used for join MCV hashing, since the underlying algorithmic
tradeoffs are similar.

Example:

CREATE TABLE t (x int);
INSERT INTO t SELECT x % 10000 FROM generate_series(1, 3000000) x;
ALTER TABLE t ALTER COLUMN x SET STATISTICS 10000;
ANALYZE t;

Before patch:
EXPLAIN (SUMMARY) SELECT * FROM t WHERE x IN (1,2,...,2000);
Seq Scan on t (cost=5.00..58280.00 rows=600000 width=4)
Filter: (x = ANY ('{1,2,...,2000}'::integer[]))
* Planning Time: 57.137 ms*
(3 rows)

After patch:
EXPLAIN (SUMMARY) SELECT * FROM t WHERE x IN (1,2,...,2000);
Seq Scan on t (cost=5.00..58280.00 rows=600000 width=4)
Filter: (x = ANY ('{1,2,...,2000}'::integer[]))
* Planning Time: 0.558 ms*
(3 rows)

Comments, suggestions, and alternative approaches are welcome!

[0]:
https://www.postgresql.org/message-id/b6316b99-565b-4c89-aa08-6aea51f54526%40gmail.com

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

Attachment	Content-Type	Size
v1-0001-Use-hash-based-matching-for-MCVs-in-ScalarArrayOp.patch	text/x-patch	15.6 KB

Responses

Re: Hash-based MCV matching for large IN-lists at 2026-01-05 08:54:17 from David Geier

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kirill Reshke	2025-12-29 20:40:27	Re: TAB completion for ALTER TABLE ... ALTER CONSTRAINT ... ENFORCED
Previous Message	Matthias van de Meent	2025-12-29 20:04:53	Re: lsyscache: free IndexAmRoutine objects returned by GetIndexAmRoutineByAmId()