Re: Add expressions to pg_restore_extended_stats()

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Tender Wang <tndrwang(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add expressions to pg_restore_extended_stats()
Date: 2026-02-02 00:39:04
Message-ID: CADkLM=du+OcctrsTk+hZryUGy=0OnPep-3LdGzut1nqF391+Eg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 30, 2026 at 3:07 AM Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
wrote:

> On Fri, Jan 30, 2026 at 12:55 AM Michael Paquier <michael(at)paquier(dot)xyz>
> wrote:
>
>> On Fri, Jan 30, 2026 at 12:08:49AM -0500, Corey Huinker wrote:
>> > 3. Keeping the 2-D text array in #1, but each "row" is a list of
>> > kwargs-like pairs like the arguments used in pg_restore_attribute_stats
>> > (i.e.
>> ARRAY['null_frac','0.5','avg_width','1.0','most_common_values',...]
>> >
>> > 4. JSON. The outer structure would be an array of objects, each object
>> > would be a key-value.
>>
>> I'd still favor 4 on the ground that it's easier to edit and read,
>> which would more in line with the MCV, dependencies, ndistinct and
>> att/rel statistics. The format proposed in the attached patch is hard
>> to work with, anyway. Now, I do take your point about composite
>> record values casted into a single text value could be confusing
>> (double-quote issues, I guess?), so perhaps a text[] as in 3 would be
>> more adapted for readability.
>
>
> Hmm, maybe it isn't so bad:
>
> SELECT '{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[];
> text
> ---------------------------------------------------
> {"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}
> (1 row)
>
> SELECT
> array_to_json('{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[]);
> array_to_json
> ---------------------------------------------------
> ["{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"]
> (1 row)
>
> Mind you, this is an ANYARRAY first casted to text, if we cast the
> pg_stats_ext_exprs.most_common_values directly to JSON then it'll drill
> down into the innards of the composite values because it can see the local
> datatypes, and that breaks our ability to use regular input functions. I
> learned that the hard way when using JSON for serializing attribute stats
> stuff when this effort first began.
>
> Before seeing that, I wanted to try option 3 first, as it brings clarity
> with no real increase in tooling other than looking for repeated keywords,
> but if you're hyped for json then I'll try that first.
>
>
>> We could also force some checks based
>> the order of the arguments in the input array, so as duplicates are
>> not an issue, I guess?
>>
>
> If we're doing a kwargs-thing then I may as well just track which keywords
> were already used. We already bail out on the whole expressions array at
> the first sign of inconsistency, so it's not like we have to decide which
> of the duplicates to keep.
>
> Structurally, I feel that import_expressions() is overcomplicated, and
>> with the correct structure tracking the state of each field, I would
>> try to reduce a bit the duplication that the patch presents, aiming at
>> less callers of statatt_build_stavalues() and statatt_set_slot(),
>> perhaps.
>>
>
> I don't think we can get around those. It's a limitation of how the
> sta(kindN/opN/collN/numbersN/valuesN) system in pg_statistic works. We want
> to fill in each stakind as we find it, and we don't know how many of them
> we've already filled out. An array of records would have been better, but
> we've got 5 parallel arrays of scalars and we have to live with it.
>

Ok, so I refactored the exprs input to be a jsonb array, and the good news
is that it results in legitmately more readable test cases. See below

-- ok, with warning: extra exprs param
SELECT pg_catalog.pg_restore_extended_stats(
'schemaname', 'stats_import',
'relname', 'test',
'statistics_schemaname', 'stats_import',
'statistics_name', 'test_stat_mcelem',
'inherited', false,
'exprs', '[
{
"avg_width": 33,
"null_frac": 0,
"n_distinct": -1,
"correlation": 1,
"histogram_bounds":
"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}",
"most_common_vals": null,
"most_common_elems": "{-1,0,1,2,3}",
"most_common_freqs": null,
"elem_count_histogram":
"{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.5}",
"most_common_elem_freqs":
"{0.25,0.25,0.5,0.25,0.25,0.25,0.5,0.25}",
"bad_param": "text no one will ever parse"
}
]'::jsonb);
WARNING: malformed expr expression: "bad_param": is not an expression key
name, value ignored

The downside is that it results in 200 more lines of code in
extended_stats_funcs.c and 500 more lines of test output in
stats_import.sql.

There are a few reasons for this:

- It's up to us to validate the [{expr},{expr},...] structure ourselves,
something a 2-D text array did rather succinctly.
- null is a valid jsonb value, and we have to check for it. This quirk
alone results in 4 more test cases, though some other test cases did go
away.
- json strings are unterminated c strings, so we have to convert them first
to c strings and then to text datums to leverage the existing attribute
stats functions

I briefly thought about making the exprs input parameter jsonb[] thus
saving some structure checks, but what little that buys us is then taken
away by making the pg_dump command that much more complex.

There are probably some simplifications that can happen:

- casting all stat values to text to avoid having to deal with the
neither-fish-nor-fowl json numeric values.
- utility function for converting jbvString values to TextDatum.
- utility function to simplify json iteration in a simple key/value object
- deciding that unknown parameters aren't worth a warning

So...does this look better? Is it worth it?

Attachment Content-Type Size
v2jsonb-0001-Add-support-for-exprs-in-pg_restore_extended.patch text/x-patch 94.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2026-02-02 00:42:48 relkind as an enum
Previous Message Michael Paquier 2026-02-02 00:12:30 Re: [PATCH] Fix error message in RemoveWalSummaryIfOlderThan to indicate file removal failure