Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Matt Long <matt(at)mattlong(dot)org>, Mark Frost <FROSTMAR(at)uk(dot)ibm(dot)com>
Cc: "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column
Date: 2025-09-09 19:19:46
Message-ID: 1209259.1757445586@sss.pgh.pa.us
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I wrote:
> * The selectivity functions believe that the upper bound on the
> frequency of non-MCEs is minfreq / 2, not the stored minfreq.
> This seems like complete brain fade: there could easily be
> elements with frequency just less than minfreq, and probably are
> if the data distribution follows Zipf's law. I did not dig into
> the git history, but I wonder if the divide-by-two business
> predates the introduction of the lossy-counting algorithm, and
> if so whether it was less insane with the original collection
> algorithm. In any case, this patch removes the divisions by 2,
> and makes some nearby cosmetic improvements.

In the light of morning, I started to have second thoughts about
this aspect of the patch. I checked the git history this time,
and found that the oldest instance of "minfreq / 2" dates to
4e57668da which originated from this discussion:

https://www.postgresql.org/message-id/flat/488DAEB8.3000402%40students.mimuw.edu.pl

It's already coded like that in Jan's initial patch, and there
was no discussion in the thread, so not a lot to be gleaned:

+ /*
+ * The element is not in MCELEM. Punt, but assure that the
+ * selectivity cannot be more than minfreq / 2.
+ */
+ return (Selectivity) Min(DEFAULT_TS_SEL, minfreq / 2);

But looking at this, I think the problem is really that the comment
fails to describe his thought process. We know that the frequency
of this not-in-the-MCE-list value cannot be more than minfreq, but
we have no idea how much less it is. I think the idea of the
division might have been to assume that an "average" non-MCE value
would have a frequency about half that of the lowest MCE. It does
seem reasonable to use some number lower than the cutoff, although
I dunno if 0.5 is exactly the right factor.

So now I'm wondering if we should leave the divisions alone and
instead fix the comments to explain why they are there. Bolstering
this is that on the two test cases you guys submitted, the patch
with the divisions gets a spot-on estimate (1 row) while removing
the divisions yields an estimate of 2 rows. I don't put a lot of
weight on that, since these are toy examples. But I wonder if you
guys could test the patch on some of your real-world cases and
see if it looks like we should keep the divisions.

regards, tom lane

In response to

Browse pgsql-performance by date

  From Date Subject
Previous Message Tom Lane 2025-09-08 23:37:01 Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column