| From: | Corey Huinker <corey(dot)huinker(at)gmail(dot)com> |
|---|---|
| To: | Michael Paquier <michael(at)paquier(dot)xyz> |
| Cc: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Tender Wang <tndrwang(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Add expressions to pg_restore_extended_stats() |
| Date: | 2026-02-02 08:21:35 |
| Message-ID: | CADkLM=dHfy2YT1J57TNt4KViqxHGHiEU-72hKA_TX1gb2u==Ow@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
>
> This one would not actually exist if using a 1-D text[] with elements
> made of key/value/key/value as the array deconstruction would take
> care of that, wouldn't it?
>
If it were 1-D we'd have to know where one expression ended and the next
one started. If you're asking what I think you're asking, an array like:
array[
array['null_frac','0.4','avg_width', '...', ...].
array['null_frac','0.5','avg_width', '...', ...].
]
this would force us to either require an order for the keys, in which case,
why bother, or it would require a loop through each 1-D array comparing
each keyword against the list of known keywords so we can put the values in
order (and resolve duplicates). It's not the cleanest thing, but I have
done it before, and it eventually resulted in
stats_fill_fcinfo_from_arg_pairs(), which could be adapted to handle this
as well.
>
> Hmm. We'd still need to deal with buggy inputs even if this
> conversion is enforced in pg_dump.
>
True, but we're already doing it in both the 2-D text array version and the
json version, the only difference is the number of hoops we have to jump
through to get an input to a *Safe() input function.
>
> > - utility function for converting jbvString values to TextDatum.
> > - utility function to simplify json iteration in a simple key/value
> object
> > - deciding that unknown parameters aren't worth a warning
>
> Do you actually need check_valid_expr_argnames() at all?
No, not if we're ok silently missing the "ok, with warning: extra exprs
param" test. It seems like something we should report, but also something
we shouldn't fail on.
> At the end
> of import_pg_statistic() we know that all the parameters have been
> found.
We don't, actually. Any un-found parameters would be left at their default
values without error. It would be easy enough to set a flag to true for
each one found and report on that, but I haven't implemented that yet.
> So we could just check on the numbers of values fetched and
> complain based on NUM_ATTRIBUTE_STATS_ELEMS? That would remove some
> code, not impact the safety. key_lookup_notnull() could be changed to
> return three possible states rather than a boolean: okay and NULL
> value, okay and not-NULL value, error if key not found in JSON blob or
> incorrect parsing of the input value (aka infinite, input function
> error, etc). Then increment the counter if one of the first two states
> is found as NULL can be valid for some of the key/value pairs.
>
In the situation you're describing, the *lack* of an expected key is itself
grounds for a WARNING and failing the whole import_expressions() assuming
that the version parameter indicated that servers of that vintage knew
about that parameter. I'd be ok with it.
>
> bv_get_float4_datum() and jbv_get_int4_datum() are alike, they could
> be refactored as a single function that takes a type name in put for
> the error message and an input function pointer. That would cut some
> code.
>
Yes, a bit. That's also true of their predecessors text_to_float4 and
text_to_int4, so it's a savings, but not unique to this implementation.
>
> > So...does this look better? Is it worth it?
>
> Well, it does to me. The pg_dump bits based on jsonb_build_object()
> are rather attractive in shape. This addresses the concerns I had
> about item ordering and unicity of the keys in a natural way through
> the data type filtering.
>
I had actually figured that the pg_dump part would actually turn you OFF of
this approach.
The JSONB type is a bit of a saving grace in that if you had duplicate
input keys {"x": "y", "x": "z"} it'll rewrite that as {"x": ["z","y"]} and
that *is* easy for us to catch in the existing tooling. That's one of the
few areas where I liked what the jsonb api did for us.
> I don't see a need for key_lookup(), its sole caller is
> key_lookup_notnull().
>
A lot of this code was exhumed from the v1 patch of the relation/attribute
stats functions [1], and they can be taken further to verify the json-type
of the value (in our case we'd only want numeric/string test variants, or
perhaps only the string-test variant.
>
> + 'exprs',
> '{{0,4,-0.75,"{1}","{0.5}","{-1,0}",-0.6,NULL,NULL,NULL},{0.25,4,-0.5,"{2}","{0.5}",NULL,1,NULL,NULL,NULL}}'::text[]);
>
> This is in the documentation, and would need fixing.
>
Lots of things need fixing if we're going this route. Part of this patch
was to see if the results horrified you, and since they clearly haven't,
the fixing is now worth it.
> +WARNING: invalid expression elements most_common_vals and
> most_common_freqs: conflicting NULL/NOT NULL
>
> For this warning in the regression tests, you should only need one
> element, reducing the number of input lines?
>
That's true IF we decide that missing expression keys are a thing that we
allow, and per conversation above it's not clear yet that we do.
> The patch has a query checking the contents of pg_stats_ext_exprs for
> test_mr_stat, no imports. This links to the multirange case mentioned
> in import_pg_statistic().
>
There *is* an input, we just have to check output from pg_stats_ext()
first, and then pg_stats_ext_exprs(). I suppose I could combine the queries
but chose not to for simplicity.
As those comments mention, there are attribute-level stat types specific
for ranges that are not reflected in extended stats via ANALYZE, so
importing expressions on test_mr isn't all that interesting, but it's there
for completeness.
>
> Something that looks missing here is that we do not attach to which
> negative attribute an item of "exprs" links to. It seems to me that
> this needs to include an extra key called "attribute" with a negative
> associated with it, so as we can rely on the attnum instead of the
> order of the items restored.
>
Oof. So there is no such attribute number exposed in pg_stat_ext_exprs(),
we have to rely on the deterministic order that they are fetched from the
view. The "expr" text definition is essentially a node tree, which we can't
guarantee stays stable across major versions, so that's no help.
jsonb_agg() doesn't guarantee order on it's own, but it can be enforced via
an ORDINALITY-like trick. I had to do this before, can I excavate that code
again. Thanks for the reminder.
> (Many other corrections redacted).
+1 to all cases.
Since clearly, you're on board with the jsonb idea, I'll make the next
version all about cleanups and tightening up this method.
The outstanding questions from the above are, given an jsonb tree of [ {
"null_frac": 0.5, ...}, {"avg_width": "1", ...}], which for notation I'll
call an array of expr1 and expr2
1. If an expected key is missing from either expr1 or expr2, is that worth
failing the whole import_expressions()?
2. Is an unexpected key in either expr1 or expr2 worth failing the whole
import_expressions()?
3. If the answer to #2 is no, do we need to warn about the existence of the
weird parameter at all, or just silently skip it?
4. Does casting the numeric scalar values (null_frac, correlation,
avg_width, n_distinct) to text make sense, since we have to put them
through type-specific input functions anyway?
5. Do any of these jsonb-twiddling operations look generically useful
enough to put in jsonb_utils.c?
(my answers would be 1. no, 2. no, 3. can't skip it, have to issue a
warning on the first weird param, but it doesn't fail anything 4. yes, and
5. maybe jbvString to cstring and/or jbvString to TextDatum)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ashutosh Bapat | 2026-02-02 08:36:07 | Re: Report bytes and transactions actually sent downtream |
| Previous Message | Michael Paquier | 2026-02-02 07:55:59 | Re: Exit walsender before confirming remote flush in logical replication |