| From: | Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com> |
|---|---|
| To: | Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com> |
| Cc: | Tatsuya Kawata <kawatatatsuya0913(at)gmail(dot)com>, David Geier <geidav(dot)pg(at)gmail(dot)com>, Chengpeng Yan <chengpeng_yan(at)outlook(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Hash-based MCV matching for large IN-lists |
| Date: | 2026-02-25 22:45:44 |
| Message-ID: | 9067a807-f130-4631-8df4-076c6d7e43b6@tantorlabs.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
I've addressed the review comments mentioned above.
David made a very good observation: for unique columns, where each
iteration effectively returns the same per-element selectivity, there is
no need to iterate at all. In such cases we can reduce the computation
to a closed-form expression, i.e. O(1) instead of running the loop O(N).
I applied this idea to unique columns and cases falling back to
DEFAULT_EQ_SEL. In both cases the loop can be replaced with a
closed-from formula implemented in calculate_combined_selectivity(). The
formula mirros the existing independent/disjoint probability model: ANY
(sel = 1 - (1 - s) ^ length or length * s ), ALL (sel = s ^ length or 1
- length*(1 - s)). It would be good to carefully review that this is
fully equivalent to the current accumulation logic.
I also exprimented with applying the same idea to elements that are not
found in MCV, are not Const, and effectively found in MCV with more than
one count. Those cases can still be accumulated using
accum_scalararray_prob(), but potentially grouped to reduce repeated work.
Overall, the optimization work can be logically split into three parts:
1. Degenerate NULL case O(N) -> O(1) [0]
2. Identical non-NULL per-element selectivity O(N) -> O(1) (can be
split into a separate thread if prederred)
3. MCV matching via hashing O(M*N) -> O(M+N) (current thread)
Feedback on how to best structure or split this work would be appreciated.
About op_is_reserved. It seems we should assign op_is_reserved = true,
because we don't reverse types like eqjoinsel_semi(). If IN-list smaller
than MCV-list we reverse it by fmgr_info(hash_mcv ? hashLeft :
hashRight, &hash_proc). Thanks for this remark.
Thoughts?
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/
| Attachment | Content-Type | Size |
|---|---|---|
| v6-0003-Use-hash-based-MCV-matching-for-ScalarArrayOpExpr.patch | text/x-patch | 17.5 KB |
| v6-0002-Use-O-1-selectivity-formula-for-eqsel-neqsel-IN-A.patch | text/x-patch | 6.4 KB |
| v6-0001-Reduce-planning-time-for-large-NOT-IN-lists-conta.patch | text/x-patch | 4.9 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Zsolt Parragi | 2026-02-25 23:18:04 | Re: amcheck: add index-all-keys-match verification for B-Tree |
| Previous Message | Tom Lane | 2026-02-25 22:32:11 | Fixing grouping expressions inside subqueries |