Add expressions to pg_restore_extended_stats()

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, Tender Wang <tndrwang(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Add expressions to pg_restore_extended_stats()
Date: 2026-01-30 05:08:49
Message-ID: CADkLM=fPcci6oPyuyEZ0F4bWqAA7HzaWO+ZPptufuX5_uWt6kw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This is a new thread that continues the work [1] of the Extended Statistics
set/restore/clear functions thread [2] which itself was a continuation of
the work [3] of the Statistics Export and Import thread [4], all of which
is too much for anyone to review, so I'll give a recap:

A longstanding complaint about postgres has been the amount of time
required to rebuild stats after a major version upgrade, during which the
database is online but under heavy I/O and queries of any real complexity
will get _terrible_ query plans because they lack the optimizer statistics
which the previous major version had just moments ago, but weren't carried
over to the new version.

Version 18 introduced the ability to import stats at the relation level and
attribute level [3], and these stats were now dumped and restored by
default via pg_upgrade. This meant that most customers could reduce the
time in which the database was online but in a state of degraded
performance. It was, however, not a complete solution, because it still
lacked statistics for extended objects (i.e. CREATE STATISTICS xyz...) and
custom statistic kinds like those found in extensions like PostGIS. Still,
this made things better for 99% of installations, and while it was not
trivial to determine if a given instance was in that 1%, enhancements were
made to vacuumdb [5] to detect what tables were missing statistics and
analyze just those tables, thus reducing the scope of the I/O-intensive
rebuild period for those in the unlucky 1%.

Work in this 19-dev cycle has sought to close that 1% gap by importing
statistics for extended statistics objects. These stats are quite a bit
more complicated than their relation and attribute equivalents, but good
progress has been made [1], resulting in the carryover of many statistics
types: ndistinct, dependencies, and MCV. All of them, except for the
statistics associated with expressions in the definition of the statistics
object (i.e. CREATE STATISTICS xyz on upper(name), ...). This last type of
statistics has proved to be a tough nut to crack for reasons I will
describe in detail. We could stop here, but if we did we would actually
create work for vacuumdb, specifically the code that processes the
--missing-stats-only option, which currently looks for matching extended
statistics data (pg_statistic_ext_data) rows that match the object
definition (pg_statistic_ext) rows, and considers any match to be
sufficient for "not missing stats". That assumption would no longer hold in
the case of stats objects that have expressions, because they'd be missing
their stxdexprs stats. While we can teach vacuumdb that difference, we
could instead solve the expressions problem, and close the statistics gap
even further [6].

We have a working but not thoroughly tested implementation (attached).
There remains one sticky problem: the serialization format of the
statistics stored in pg_statistic_ext_data.stxdexprs. The type of the
attribute is pg_catalog.pg_statistic[], which is to say that it's an array
of records, the length of the array is equal to the number of expressions
in the extended statistics object. pg_statistic is where attribute
statistics are imported, so the structure has the complexity of attribute
stats itself, slightly reduced by the fact that the fields describing the
attribute are left as InvalidOid values, but still quite complicated.
Several of the attributes in pg_statistic are of type ANYARRAY, because
they are most-common-values/histogram/most-common-elements arrays, each of
which has a composite datatype determined by the datatype of the
expression(s) and other columns in the object definition. This presents a
problem for utilities like deconstruct_array(), in that the datatype to be
deconstructed varies by column and by the datatype of the expression
definition, and that datatype could itself be an array which
deconstruct_array would then try to deconstruct...there is no way to get
deconstruct_array() to stop 2 levels deep.

This problem was solved for pg_restore_attribute_stats by having pg_dump
export the ANYARRAY values CAST-ed to type "text" rather than "text[]",
which allowed each type of statistics to be decomposed according to it's
own rules, and that worked fine when each statistic type became a parameter
in pg_restore_attribute_stats(). But now we've got all of those types, and
we're getting them multiple times, so that method doesn't quite scale.

I've considered several ways around this issue:

1. Defining a strict order for the statistics types, following the order
they appear in pg_stats_ext (null_frac, avg_width, n_distinct,
most_common_elems, ...) and then exprs from pg_stats_ext_exprs in last
place. Each value is CAST-ed to "text",
which means that we can deconstruct them in a fashion very similar to how
we did for attribute stats. Those text values are put into an array in the
strict order, and those arrays are aggregated into a 2-D array.

Pros:
- This method is implemented and it works, and the code for it is reusing
tools and coding patterns we've already incorporated (deconstruct_array,
safe input functions, arglist arrays). Patch attached.

Cons:
- The ordering is completely opaque. Documenting that ordering might help a
little, but there's nothing intuitive about it and checking it has been an
irritant to author and reviewer alike.
- This opaque ordering must be kept in sync between
pg_restore_extended_stats and pg_dump or else statistical garbage will
result.

2. Defining a record type specifically for purpose.

Pros:
- It could be decomposed via standard composite input function, and then
each type deconstructed on its own terms

Cons:
- It's type-clutter, and a type that is likely only used during upgrades.
- When new stats types are introduced, the structure would also have to
change, breaking typecasts of composite values from older versions. This
alone makes this option unpalatable to most reviewers, and I can't advocate
for it.

3. Keeping the 2-D text array in #1, but each "row" is a list of
kwargs-like pairs like the arguments used in pg_restore_attribute_stats
(i.e. ARRAY['null_frac','0.5','avg_width','1.0','most_common_values',...]

Pros:
- Flexibility in ordering
- Clarity at a glance, provided that the reader has seen the kwargs
convention of the pg_restore_*_stats functions.
- Still uses existing tooling like #1, and not that much more overhead.
- The name-value pairing problem has the same solution as the arg-pairing
that the function already does

Cons:
- adds overhead of storing the stat type names, and the key-value pairing
- the 2-D nature of the array requires that the number of elements be
fixed, so we couldn't leave out a stat type from one row unless we left it
out of the other one as well
- adds the possibility of duplicate names

4. JSON. The outer structure would be an array of objects, each object
would be a key-value.

Pros:
- Flexibility in ordering
- Clarity at a glance in a format well understood even without prior
knowledge of our kwargs convention
- we have already implemented similar things for the new formats of
pg_ndistinct and pg_dependences.
- This method currently has the interest of Michael Paquier, the committer
of all the v19 work to date.

Cons:
- Requires implementing a state engine to parse the json, check for missing
values, resolve duplicates. We do that for pg_dependencies, and that takes
800-ish lines of code to handle 3 key names, this would have 10.
- the escaping of values in a composite record CASTed to text and then
further encoded as a JSON value would be extremely unreadable, and likely
quite bloated.
- using JSON for stats serialization met with immediate strong opposition
from several reviewers. That resistance may not be there for this vastly
reduced scope, especially in light of the new JSON-compatible formats for
pg_ndistinct and pg_dependencies, but it does give me pause.

And...that's the major decision point. If we solve that, the rest is far
less controversial. My apologies that this summary itself needs a summary.

Thanks for reading. Eager to hear perspectives on the serialization methods
propsed, or suggestions of other methods.

[1] https://commitfest.postgresql.org/patch/5517/
[2]
https://www.postgresql.org/message-id/flat/aTE4AL7U0dp1Jjrx%40paquier.xyz#72116daf9d37828a47ce477a852a78d3
[3] https://commitfest.postgresql.org/patch/4538/
[4]
https://www.postgresql.org/message-id/flat/CADkLM%3DcB0rF3p_FuWRTMSV0983ihTRpsH%2BOCpNyiqE7Wk0vUWA%40mail.gmail.com
[5] https://commitfest.postgresql.org/patch/5523/
[6] The issue of custom statistic kinds like those found in PostGIS would
still remain.

Attachment Content-Type Size
v1-0001-Add-support-for-exprs-in-pg_restore_extended_stat.patch text/x-patch 51.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2026-01-30 05:13:37 Re: A out of date comment of WaitForWALToBecomeAvailable
Previous Message Michael Paquier 2026-01-30 05:06:08 Re: AIX support