Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: Chengpeng Yan <chengpeng_yan(at)Outlook(dot)com>
Cc: Tatsuya Kawata <kawatatatsuya0913(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, John Naylor <johncnaylorls(at)gmail(dot)com>
Subject: Re: [PATCH] ANALYZE: hash-accelerate MCV tracking for equality-only types
Date: 2026-02-16 16:16:26
Message-ID: 04713091-2b05-4546-9f17-cca42119188f@tantorlabs.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 07.02.2026 16:39, Chengpeng Yan wrote:
>
> > On Feb 3, 2026, at 13:39, John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
> >
> > To evaluate that, it would be easier if the patch were split into two
> > commits, one with the simple step approach, and one that changes to
> > the other approach.
> >
> > Also, we have some places that switch from e.g. an array to a hash
> > table once a collection gets above a certain threshold. In this case,
> > however, having less than 100 stat target is unheard of, in my
> > experience. It's only a tiny amount of code, but it doesn't seem worth
> > special-casing. Others may have a different opinion, and I'm not sure
> > we don't already have this behavior elsewhere.
>
> Thanks for the suggestion — I’ve applied it in v4 by splitting the
> changes into two commits.

I think it would be beneficial to split it into two logically
independent parts. Right now the patch introduces two different changes
at once:

1. Replacing the original shift-based singleton handling with the cursor
based eviction mechanism.
2. Adding the hash-based lookup.

Splitting the patch would make review easier because I can first focus
purely on the algorithmic transformation (shift -> cursor) and verify
semantic equivalence.

>
> Regarding the hash threshold: my initial approach followed existing
> postgres precedent for hash-based MCV matching. For example, commit
> 057012b205a (“Speed up eqjoinsel() with lots of MCV entries”) introduces
> EQJOINSEL_MCV_HASH_THRESHOLD in selfuncs.c, only enabling hashing once
> the MCV set is large enough to amortize the setup cost. The exact
> threshold differs, but the pattern is similar.
>
> Also, while 100 is the default statistics target, there are internal
> cases that use smaller values — e.g. vacuumdb staged analyze temporarily
> sets default_statistics_target to 1 and 10 in vacuuming.c — which was
> part of the motivation for keeping the thresholded behavior.
>
> That said, I’m happy to adjust or remove the threshold if the consensus
> is to simplify this further.

I would probably not rely too heavily on EQJOINSEL_MCV_HASH_THRESHOLD as
a direct reference point here. That threshold is somewhat heuristic as
well, and the cost trade-offs in eqjoinsel() are not identical to what
we have in compute_distinct_stats(). In particular, here the break-even
point will strongly depend on the data type and value width. For
example, with wider datums (e.g. bytea), the comparison cost increases,
and the threshold at which hashing amortizes its setup cost may shift
noticeably. We don't need to compute the exact optimal threshold - just
get a reasonable order.

--
Best regards.
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-02-16 16:16:57 Re: index prefetching
Previous Message Alexandre Felipe 2026-02-16 16:07:02 Re: Incremental View Maintenance, take 2