Re: Hash-based MCV matching for large IN-lists

From: David Geier <geidav(dot)pg(at)gmail(dot)com>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, Chengpeng Yan <chengpeng_yan(at)Outlook(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hash-based MCV matching for large IN-lists
Date: 2026-02-02 09:29:16
Message-ID: ff7d491d-f084-4935-8417-c0b7285cdd89@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

> Attached v4 patch with above fixes.

Good progress!

I did another pass over the code, focusing on structure:

- MCVHasContext and MCVInHashContext are identical. MCVHashEntry and
MCVInHashEntry only differ by the count member. I would, as said before,
merge them and simply not use the count member for the join case.

- hash_mcv_in() and mcvs_in_equal() are identical to hash_mcv() and
mcvs_equal(). Let's remove the new functions and use the existing ones
instead, in the spirit of the previous point.

- The threshold constants are also identical. I would merge them into a
single, e.g. MCV_HASH_THRESHOLD, in the spirit of the previous two points.

- MCVHashTable_hash will then be interchangable with
MCVInHashTable_hash. So let's remove MCVInHashTable_hash, in the spirit
of the previous three points.

- Use palloc_array() instead of palloc() when allocating arrays.

- We can avoid allocating the all-true elem_const array by passing NULL
for elem_const to scalararray_mcv_hash_match(), and considering a NULL
pointer to mean "all elements are constant".

- The following comment got copy&pasted from eqsel_internal() twice. It
reads a little strange now because we're not punting here by immediately
returning like in eqsel_internal() but instead fallback to the original
code path. Maybe say instead "... falling back to default code path to
compute default selectivity" or something like that.
/*
* If expression is not variable = something or something =
* variable, then punt and return a default estimate.
*/

- The call to fmgr_info(opfuncoid, &eqproc) is currently under have_mcvs
but can be moved into the next if.

- elem_nulls and elem_const does have to be 0-initialized via palloc0().
All elements are set in the subsequent for-loop. I believe elem_values
also doesn't have to be 0-initialized via palloc0().

- Have you checked there there's test coverage for the special cases
(nvalues_non_mcv > 0, nvalues_nonconst > 0, IN contains NULL,
isEnequality==true, etc.)? If not let's add tests for these.

I'll do a 2nd iteration, focusing on correctness, once these comments
are addressed and I've got the SQL from you so that I can test the
corner cases manually.

--
David Geier

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Antonin Houska 2026-02-02 09:35:29 Re: Adding REPACK [concurrently]
Previous Message Jakub Wartak 2026-02-02 09:22:37 Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?