Re: Statistics Import and Export

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: Statistics Import and Export
Date: 2024-03-23 01:51:01
Message-ID: CADkLM=fFwkd7VbO4SPCyX-2f-fyGraka5J+5fcp1QMrCefu68Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

v12 attached.

0001 -

The functions pg_set_relation_stats() and pg_set_attribute_stats() now
return void. There just weren't enough conditions where a condition was
considered recoverable to justify having it. This may mean that combining
multiple pg_set_attribute_stats calls into one compound statement may no
longer be desirable, but that's just one of the places where I'd like
feedback on how pg_dump/pg_restore use these functions.

The function pg_set_attribute_stats() now has NULL defaults for all
stakind-based statistics types. Thus, you can set statistics on a more
terse basis, like so:

SELECT pg_catalog.pg_set_attribute_stats(
relation => 'stats_export_import.test'::regclass,
attname => 'id'::name,
inherited => false::boolean,
null_frac => 0.5::real,
avg_width => 2::integer,
n_distinct => -0.1::real,
most_common_vals => '{2,1,3}'::text,
most_common_freqs => '{0.3,0.25,0.05}'::real[]
);

This would generate a pg_statistic row with exactly one stakind in it, and
replaces whatever statistics previously existed for that attribute.

It now checks for many types of data inconsistencies, and most (35) of
those have test coverage in the regression. There's a few areas still
uncovered, mostly surrounding histograms where the datatype is dependent on
the attribute.

The functions both require that the caller be the owner of the table/index.

The function pg_set_relation_stats is largely unchanged from previous
versions.

Key areas where I'm seeking feedback:

- What additional checks can be made to ensure validity of statistics?
- What additional regression tests would be desirable?
- What extra information can we add to the error messages to give the user
an idea of how to fix the error?
- What are some edge cases we should test concerning putting bad stats in a
table to get an undesirable outcome?

0002 -

This patch concerns invoking the functions in 0001 via
pg_restore/pg_upgrade. Little has changed here. Dumping statistics is
currently the default for pg_dump/pg_restore/pg_upgrade, and can be
switched off with the switch --no-statistics. Some have expressed concern
about whether stats dumping should be the default. I have a slight
preference for making it the default, for the following reasons:

- The existing commandline switches are all --no-something based, and this
follows the pattern.
- Complaints about poor performance post-upgrade are often the result of
the user not knowing about vacuumdb --analyze-in-stages or the need to
manually ANALYZE. If they don't know about that, how can we expect them to
know about about new switches in pg_upgrade?
- The failure condition means that the user has a table with no stats in it
(or possibly partial stats, if we change how we make the calls), which is
exactly where they were before they made the call.
- Any performance regressions will be remedied with the next autovacuum or
manual ANALYZE.
- If we had a positive flag (e.g. --with-statistics or just --statistics),
and we then changed the default, that could be considered a POLA violation.

Key areas where I'm seeking feedback:

- What level of errors in a restore will a user tolerate, and what should
be done to the error messages to indicate that the data itself is fine, but
a manual operation to update stats on that particular table is now
warranted?
- To what degree could pg_restore/pg_upgrade take that recovery action
automatically?
- Should the individual attribute/class set function calls be grouped by
relation, so that they all succeed/fail together, or should they be called
separately, each able to succeed or fail on their own?
- Any other concerns about how to best use these new functions.

Attachment Content-Type Size
v12-0001-Create-pg_set_relation_stats-pg_set_attribute_st.patch text/x-patch 105.8 KB
v12-0002-Enable-dumping-of-table-index-stats-in-pg_dump.patch text/x-patch 21.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-03-23 02:17:26 Re: sublink [exists (select xxx group by grouping sets ())] causes an assertion error
Previous Message Jeff Davis 2024-03-23 01:22:21 Re: sublink [exists (select xxx group by grouping sets ())] causes an assertion error