Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Frost <FROSTMAR(at)uk(dot)ibm(dot)com>
Cc: "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column
Date: 2025-06-07 01:29:54
Message-ID: 1751511.1749259794@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Mark Frost <FROSTMAR(at)uk(dot)ibm(dot)com> writes:
> Actually *any* most_common_elems stats would be fine, because the reasoning is:
> * If the searched element is in most_common_elems we know it's frequency
> * If it's not, it's less frequent than the least most_common_elems
> So in our case when every row is unique, we'd only actually need stats to record a single most_common_elems (if only it would record one)

Well, we don't have a most common element in this scenario --- the
whole point is that the occurrence counts resulting from the lossy
counting algorithm are too low to be trustworthy. However, what we
do have is the cutoff frequency, and it seems to me that we could use
that as the estimate of the maximum frequency of the non-MCEs.

Attached is an extremely crude prototype patch for that. What it
does is to create an MCELEM stats entry with zero MCEs, but containing
min and max frequencies equal to the cutoff frequency (plus the nulls
frequency, which we know accurately in any case). Mark, this fixes
your example case, but I wonder if it fixes your original problem ---
are you in a position to test it?

Assuming we like this direction, the main thing that makes this a hack
not a polished patch is that I had to strongarm the code into storing
a zero-length values array. What update_attstats would normally do
is leave the values column of the MCELEM stats slot NULL, which then
causes get_attstatsslot to throw a that-shouldn't-be-null error.
An alternative answer is to change get_attstatsslot to allow a null,
but I'm not sure that that's any cleaner. Either way it seems like
there's a hazard of breaking some code that isn't expecting the case.

An alternative that feels cleaner but a good bit more invasive
is to get rid of the convention of storing the min/max/nulls
frequencies as extra entries in the MCELEM numbers entry ---
which surely is a hack too --- and put them into some new slot
type. I'm not sure if that's enough nicer to be worth the
conversion pain.

Thoughts?

regards, tom lane

Attachment Content-Type Size
wip-always-record-minimum-MCE-frequency.patch text/x-diff 2.1 KB

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Dimitrios Apostolou 2025-06-10 00:51:01 [PATCH] ALTER TABLE ADD FOREIGN KEY to partitioned table, is not parallelized
Previous Message Mark Frost 2025-06-06 09:21:34 RE: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column