Re: Add expressions to pg_restore_extended_stats()

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Tender Wang <tndrwang(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(at)vondra(dot)me>
Subject: Re: Add expressions to pg_restore_extended_stats()
Date: 2026-02-04 04:05:58
Message-ID: CADkLM=cPB_V+oV5d+5bfZGA_1Loa7iMmqXWj+SM-YGzxoUYrcQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
>
> The code paths that catch my attention is how the data is inserted,
> though, as in serialize_expr_stats(), and we have the following thing
> that skips the insertion of a pg_statistic tuple for an expression
> with invalid data (I thought this should be the case, now I am sure it
> is the case):
> if (!stats->stats_valid)
> {
> astate = accumArrayResult(astate,
> (Datum) 0,
> true,
> typOid,
> CurrentMemoryContext);
> continue;
> }
>
> So depending on what ANALYZE thinks in terms of what data is valid or
> not, it is entirely possible that we don't insert any data all all for
> some of the expressions.
>

+1

"valid" in this case could also mean "totally inconclusive and therefore
not worth storing".

> That pretty much answers the previous question I had that I could not
> answer to yet a few mails ago: the restore function needs to be
> flexible with the restore of expressions, and it looks like we should
> not expect all the inputs to be set. It is OK to not restore any data
> at all for some of the expressions, or skip the all set of expressions
> if nothing is given in input because we found no expression stats data.
>

I'm all for tolerating an undercount in exported statistics. But this got
me wondering how the system view handles lining up stats to its expression
counterpart when there's an undercount.

And, uh, well, it doesn't:

FROM pg_statistic_ext s JOIN pg_class c ON (c.oid = s.stxrelid)
LEFT JOIN pg_statistic_ext_data sd ON (s.oid = sd.stxoid)
LEFT JOIN pg_namespace cn ON (cn.oid = c.relnamespace)
LEFT JOIN pg_namespace sn ON (sn.oid = s.stxnamespace)
JOIN LATERAL (
SELECT unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS
expr,
unnest(sd.stxdexpr)::pg_statistic AS a
) stat ON (stat.expr IS NOT NULL)

Two unnests like that will just null-pad the shorter list, so if there is a
missing pg_statistic, then the system view could be giving us stats from a
different expression. Now, we would probably detect that if the data types
don't line up...but the data types could be compatible enough...ick.

>
> As far as I can see, the proposed patch is incorrect and inconsistent
> with the backend regarding all that. For example:
> CREATE TABLE test (name text);
> CREATE STATISTICS stats_exprs (dependencies)
> ON lower(name), upper(name)
> FROM test;
>
> So based on my read of the code, we could expect zero, one or two
> pg_statistic tuples based on what ANALYZE thinks. The restore
> functions expects always two elements in its inner array. IMO, we
> should have the flexibility to pass down 0, 1 or 2 elements in this
> 1-D array, without failing as long as the elements of the input data
> have valid data.
>
> The patch allows an object pattern like this one, for example, with a
> second object being empty:
> 'exprs', '[{"avg_width": "7", [all valid fields] .. }, {}]
>
> Unfortunately, on such input we insert TWO pg_statistic tuple. We
> could rely on the order of the items in the 1-D array, but it seems to
> me that we will have a much easier life if we shape the input data
> based on the following rules, matching with the policy that ANALYZE
> can allow:
> - An input array can have as many elements as the number of
> expressions.
>

+1

> - The expression a single expression refers ought to be tracked in
> each element, individually. This means at least the addition of a
> negative attribute number in EACH element (I am entirely rejecting my
> wilder idea of the expression text at this stage). If the "exprs"
> data has no element for a given expression, that's fine, this has the
> same meaning as no tuples found in pg_statistic for an expression.
>

I agree in principle, but I also fear that the system view may be wrong, so
even if we lined up the expression text from pg_statistic_ext with the
expression text from pg_stats_ext, the view itself would have already
mis-identified which stats go where, and we have no direct access to
pg_statistic_ext_data because of the security barrier.

And before you ask, there is *nothing* in pg_statistic_ext_data that
indicate which expression it belongs to in the pg_statistic_ext beyond its
position within the array, and we now know that isn't guaranteed.

- An element in the input array should have all its key/value pairs
> set, if set for an expression.
>

That won't fly, because there will be pg_dumps from prior versions that
won't know about future keys which would then be required.

> Does this analysis make sense to you?
>

It does, but I'm not seeing how we line up the right stats to the right
element in the event of an undercount. I've been digging around in the
planner, and it seems like it just takes the
Anum_pg_statistic_ext_data_stxdexpr, expands it, and then
statext_expressions_load just assumes that the array will have something at
that index. I hope I'm misreading this, because we may have uncovered
something broken.

Then again it could be that expressions always have stats built for them,
so the undercount never actually happens, thus the arrays all line up.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2026-02-04 04:46:22 Re: [Proposal] Adding Log File Capability to pg_createsubscriber
Previous Message wenhui qiu 2026-02-04 03:53:08 Re: Convert NOT IN sublinks to anti-joins when safe