| From: | David Geier <geidav(dot)pg(at)gmail(dot)com> |
|---|---|
| To: | Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, Chengpeng Yan <chengpeng_yan(at)Outlook(dot)com> |
| Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Hash-based MCV matching for large IN-lists |
| Date: | 2026-02-02 09:29:16 |
| Message-ID: | ff7d491d-f084-4935-8417-c0b7285cdd89@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi!
> Attached v4 patch with above fixes.
Good progress!
I did another pass over the code, focusing on structure:
- MCVHasContext and MCVInHashContext are identical. MCVHashEntry and
MCVInHashEntry only differ by the count member. I would, as said before,
merge them and simply not use the count member for the join case.
- hash_mcv_in() and mcvs_in_equal() are identical to hash_mcv() and
mcvs_equal(). Let's remove the new functions and use the existing ones
instead, in the spirit of the previous point.
- The threshold constants are also identical. I would merge them into a
single, e.g. MCV_HASH_THRESHOLD, in the spirit of the previous two points.
- MCVHashTable_hash will then be interchangable with
MCVInHashTable_hash. So let's remove MCVInHashTable_hash, in the spirit
of the previous three points.
- Use palloc_array() instead of palloc() when allocating arrays.
- We can avoid allocating the all-true elem_const array by passing NULL
for elem_const to scalararray_mcv_hash_match(), and considering a NULL
pointer to mean "all elements are constant".
- The following comment got copy&pasted from eqsel_internal() twice. It
reads a little strange now because we're not punting here by immediately
returning like in eqsel_internal() but instead fallback to the original
code path. Maybe say instead "... falling back to default code path to
compute default selectivity" or something like that.
/*
* If expression is not variable = something or something =
* variable, then punt and return a default estimate.
*/
- The call to fmgr_info(opfuncoid, &eqproc) is currently under have_mcvs
but can be moved into the next if.
- elem_nulls and elem_const does have to be 0-initialized via palloc0().
All elements are set in the subsequent for-loop. I believe elem_values
also doesn't have to be 0-initialized via palloc0().
- Have you checked there there's test coverage for the special cases
(nvalues_non_mcv > 0, nvalues_nonconst > 0, IN contains NULL,
isEnequality==true, etc.)? If not let's add tests for these.
I'll do a 2nd iteration, focusing on correctness, once these comments
are addressed and I've got the SQL from you so that I can test the
corner cases manually.
--
David Geier
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Antonin Houska | 2026-02-02 09:35:29 | Re: Adding REPACK [concurrently] |
| Previous Message | Jakub Wartak | 2026-02-02 09:22:37 | Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |