Quick Links

Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matt Long <matt(at)mattlong(dot)org>
Cc:	Mark Frost <FROSTMAR(at)uk(dot)ibm(dot)com>, "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject:	Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column
Date:	2025-09-08 23:37:01
Message-ID:	987464.1757374621@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Matt Long <matt(at)mattlong(dot)org> writes:
> Not to let perfect be the enemy of better, but we're facing a variant of
> this issue that would not be addressed by the proposed patch.
> ...
> In this case, the effects of the proposed patch are not applied since the
> most_common_elems array is not empty. I'm not a statistician, so maybe this
> wouldn't be valid, but it seems like using the highest frequency of an
> element that did not qualify for the mce list instead of the 0.5% default
> frequency could be an elegant, but more invasive solution.

Yeah, I think you are quite right: we can apply this idea not only
when the MCE list is empty, but whenever we didn't have to truncate
the MCE list. In that case we know there are no additional element
values that exceed the cutoff frequency, and that's what the
selectivity functions want to know.

Nosing around in the code that uses STATISTIC_KIND_MCELEM entries,
I spotted two additional issues that the attached v2 patch addresses:

* ts_typanalyze/ts_selfuncs have code essentially identical to the
array case, and should receive the same treatment.

* The selectivity functions believe that the upper bound on the
frequency of non-MCEs is minfreq / 2, not the stored minfreq.
This seems like complete brain fade: there could easily be
elements with frequency just less than minfreq, and probably are
if the data distribution follows Zipf's law. I did not dig into
the git history, but I wonder if the divide-by-two business
predates the introduction of the lossy-counting algorithm, and
if so whether it was less insane with the original collection
algorithm. In any case, this patch removes the divisions by 2,
and makes some nearby cosmetic improvements.

Many thanks for the suggestion!

regards, tom lane

Attachment	Content-Type	Size
v2-0001-Track-the-maximum-possible-frequency-of-non-MCE-a.patch	text/x-diff	14.7 KB

In response to

Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column at 2025-09-08 19:13:41 from Matt Long

Browse pgsql-performance by date

	From	Date	Subject
Previous Message	Matt Long	2025-09-08 19:13:41	Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column