| From: | Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com> |
|---|---|
| To: | David Geier <geidav(dot)pg(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Hash-based MCV matching for large IN-lists |
| Date: | 2026-01-14 10:19:36 |
| Message-ID: | 988e3168-6096-488a-bb42-787e1e8c21a4@tantorlabs.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi David!
Thanks for feedback.
On 05.01.2026 11:54, David Geier wrote:
>> This patch introduces a hash-based matching path, analogous to what is
>> already done for MCV matching in join selectivity estimation (057012b
>> commit). Instead of linearly scanning the MCV array for each IN-list
>> element, we build a hash table and probe it to identify matches.
>>
>> The hash table is built over the MCV values, not over the IN-list. The
>> IN-list may contain NULLs, non-Const expressions, and duplicate values,
>> whereas the MCV list is guaranteed to contain distinct, non-NULL values
>> and represents the statistically meaningful domain we are matching
>> against. Hashing the MCVs therefore avoids duplicate work and directly
>> supports selectivity estimation.
> The downside of doing it this way is that we always pay the price of
> building a possibly big hash table if the column has a lot of MCVs, even
> for small IN lists. Why can't we build the hash table always on the
> smaller list, like we do already in the join selectivity estimation?
>
> For NULL we can add a flag to the hash entry, non-Const expressions must
> anyways be evaluated and duplicate values will be discarded during insert.
After thinking more about this I realized that this is actually a better
match for how selectivity is currently modeled. After this comments in
master
* If we were being really tense we would try to confirm that the
* elements are all distinct, but that would be expensive and it
* doesn't seem to be worth the cycles; it would amount to
penalizing
* well-written queries in favor of poorly-written ones.
However, we
* do protect ourselves a little bit by checking whether the
* disjointness assumption leads to an impossible (out of range)
* probability; if so, we fall back to the normal calculation.
when the hash table is built on the IN-list, duplicate IN-list values
are automatically eliminated during insertion, so we no longer risk
summing the same MCV frequency multiple times. This makes the
disjoint-probability estimate more robust and in practice slightly more
accurate.
One thing I initially missed that there are actually three different
places where ScalarArrayOpExpr is handled - the Const array case, the
ArrayExpr case and others - and Const and ArrayExpr require different
implementation of the same idea. In Const case we can directly hash and
probe Datum value, while ArrayExpr case we must work on Node* element,
separating constant and non-constant entries and only hashing the
constants. The current v2 therefore applies the same MCV-hash
optimization in both branches, but using two tailored code paths that
preserve the existing semantics of how non-Const elements are handled by
var_eq_non_const().
If the MCV list is smaller than the IN-list, the behavior is the same as
in v1 of the patch. If the IN-list is smaller, we instead build a hash
table over the distinct constant elements of the IN-list and then:
- Scan the MCV list and sum the frequencies of those MCVs that appear in
the IN-list;
- Count how many distinct IN-list not null constant elements are not
present in the MCV list;
- Estimate the probability of each such non-MCV value using the
remaining frequency mass;
- Handle non-constant IN-list elements separately using
var_eq_non_const(), exactly as in the existing implementation.
>> For each IN-list element, if a matching MCV is found, we add the
>> corresponding MCV frequency to the selectivity estimate. If no match is
>> found, the remaining selectivity is estimated in the same way as the
>> existing non-MCV path (similar to var_eq_const when the constant is not
>> present in the MCV list).
>>
> The code in master currently calls an operator-specific selectivity
> estimation function. For equality this is typically eqsel() but the
> function can be specified during CREATE OPERATOR.
>
> Can be safely special-case the behavior of eqsel() for all possible
> operators for the ScalarArrayOpExpr case?
Unfortunately there is no safe way to make this optimization generic for
arbitrary restrict functions, because a custom RESTRICT function does
not have to use MCVs at all. IMO, in practice the vast majority of
ScalarArrayOpExpr uses with = or <> rely on the built-in equality
operators whose selectivity is computed by eqsel()/neqsel(), so I
limited this optimization to those cases.
I’ve attached v2 of the patch. It currently uses two fairly large helper
functions for the Const and ArrayExpr cases; this is intentional to keep
the logic explicit and reviewable, even though these will likely need
refactoring or consolidation later.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Use-hash-based-matching-for-MCVs-in-ScalarArrayOp.patch | text/x-patch | 37.2 KB |
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | Pierre | 2026-01-14 10:18:00 | Re: [PATCH] check kernel version for io_method |