Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Frost <FROSTMAR(at)uk(dot)ibm(dot)com>
Cc: "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column
Date: 2025-07-18 21:54:05
Message-ID: 2194366.1752875645@sss.pgh.pa.us
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I wrote:
> Well, we don't have a most common element in this scenario --- the
> whole point is that the occurrence counts resulting from the lossy
> counting algorithm are too low to be trustworthy. However, what we
> do have is the cutoff frequency, and it seems to me that we could use
> that as the estimate of the maximum frequency of the non-MCEs.

Here's a less crude patch for that. The array_typanalyze.c changes
are the same as before, but I reconsidered what to do about this
stumbling block:

> Assuming we like this direction, the main thing that makes this a hack
> not a polished patch is that I had to strongarm the code into storing
> a zero-length values array. What update_attstats would normally do
> is leave the values column of the MCELEM stats slot NULL, which then
> causes get_attstatsslot to throw a that-shouldn't-be-null error.
> An alternative answer is to change get_attstatsslot to allow a null,
> but I'm not sure that that's any cleaner. Either way it seems like
> there's a hazard of breaking some code that isn't expecting the case.

I concluded that letting get_attstatsslot accept nulls is too risky;
there is probably code that assumes that get_attstatsslot would've
rejected that. Instead, this version changes update_attstats' rule
for when to store an array rather than a null. Now it will do so
if the passed-in stavalues pointer isn't null, even if numvalues
is zero. I think that this doesn't change the outcome for any
existing analyze routines because they wouldn't pass a data pointer
if they have no values; and even if it does, storing an empty array
not a null seems like it should be pretty harmless.

> An alternative that feels cleaner but a good bit more invasive
> is to get rid of the convention of storing the min/max/nulls
> frequencies as extra entries in the MCELEM numbers entry ---
> which surely is a hack too --- and put them into some new slot
> type. I'm not sure if that's enough nicer to be worth the
> conversion pain.

A year ago I would have seriously considered doing it that way.
But now that we have code to dump-n-restore stats, that code would
have to be adjusted to convert the old representation. It's not
worth it for this case.

Hence, v1 attached, now with a commit message.

regards, tom lane

Attachment Content-Type Size
v1-0001-Track-the-maximum-possible-frequency-of-non-MCE-a.patch text/x-diff 4.3 KB

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Xuan Chen 2025-07-19 23:03:14 Question: Is it valid for a parent node's total cost to be lower than a child's total cost in EXPLAIN?
Previous Message Jerry Brenner 2025-07-18 13:20:03 Re: Is there a way to identify a plan generated by GECO?