Re: Statistics Import and Export

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: Statistics Import and Export
Date: 2024-03-25 10:08:32
Message-ID: CAExHW5sCAGbeD66vqR1kgVvEbOnmGiLGjSYYyqgAetposViJtw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Corey,

On Sat, Mar 23, 2024 at 7:21 AM Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
wrote:

> v12 attached.
>
> 0001 -
>
>
Some random comments

+SELECT
+ format('SELECT pg_catalog.pg_set_attribute_stats( '
+ || 'relation => %L::regclass::oid, attname => %L::name, '
+ || 'inherited => %L::boolean, null_frac => %L::real, '
+ || 'avg_width => %L::integer, n_distinct => %L::real, '
+ || 'most_common_vals => %L::text, '
+ || 'most_common_freqs => %L::real[], '
+ || 'histogram_bounds => %L::text, '
+ || 'correlation => %L::real, '
+ || 'most_common_elems => %L::text, '
+ || 'most_common_elem_freqs => %L::real[], '
+ || 'elem_count_histogram => %L::real[], '
+ || 'range_length_histogram => %L::text, '
+ || 'range_empty_frac => %L::real, '
+ || 'range_bounds_histogram => %L::text) ',
+ 'stats_export_import.' || s.tablename || '_clone', s.attname,
+ s.inherited, s.null_frac,
+ s.avg_width, s.n_distinct,
+ s.most_common_vals, s.most_common_freqs, s.histogram_bounds,
+ s.correlation, s.most_common_elems, s.most_common_elem_freqs,
+ s.elem_count_histogram, s.range_length_histogram,
+ s.range_empty_frac, s.range_bounds_histogram)
+FROM pg_catalog.pg_stats AS s
+WHERE s.schemaname = 'stats_export_import'
+AND s.tablename IN ('test', 'is_odd')
+\gexec

Why do we need to construct the command and execute? Can we instead execute
the function directly? That would also avoid ECHO magic.

+ <table id="functions-admin-statsimport">
+ <title>Database Object Statistics Import Functions</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ Function
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>

COMMENT: The functions throw many validation errors. Do we want to list the
acceptable/unacceptable input values in the documentation corresponding to
those? I don't expect one line per argument validation. Something like
"these, these and these arguments can not be NULL" or "both arguments in
each of the pairs x and y, a and b, and c and d should be non-NULL or NULL
respectively".

> The functions pg_set_relation_stats() and pg_set_attribute_stats() now
> return void. There just weren't enough conditions where a condition was
> considered recoverable to justify having it. This may mean that combining
> multiple pg_set_attribute_stats calls into one compound statement may no
> longer be desirable, but that's just one of the places where I'd like
> feedback on how pg_dump/pg_restore use these functions.
>
>
> 0002 -
>
> This patch concerns invoking the functions in 0001 via
> pg_restore/pg_upgrade. Little has changed here. Dumping statistics is
> currently the default for pg_dump/pg_restore/pg_upgrade, and can be
> switched off with the switch --no-statistics. Some have expressed concern
> about whether stats dumping should be the default. I have a slight
> preference for making it the default, for the following reasons:
>
>
+ /* Statistics are dependent on the definition, not the data */
+ /* Views don't have stats */
+ if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
+ (tbinfo->relkind == RELKIND_VIEW))
+ dumpRelationStats(fout, &tbinfo->dobj, reltypename,
+ tbinfo->dobj.dumpId);
+

Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide
dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be
useful
if some its partitions are not dumped. Said that dumping statistics on
foreign
table makes sense since they do not contain data but the statistics still
makes sense.

>
> Key areas where I'm seeking feedback:
>
> - What level of errors in a restore will a user tolerate, and what should
> be done to the error messages to indicate that the data itself is fine, but
> a manual operation to update stats on that particular table is now
> warranted?
> - To what degree could pg_restore/pg_upgrade take that recovery action
> automatically?
> - Should the individual attribute/class set function calls be grouped by
> relation, so that they all succeed/fail together, or should they be called
> separately, each able to succeed or fail on their own?
> - Any other concerns about how to best use these new functions.
>
>
>
Whether or not I pass --no-statistics, there is no difference in the dump
output. Am I missing something?
$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
$

IIUC, pg_dump includes statistics by default. That means all our pg_dump
related tests will have statistics output by default. That's good since the
functionality will always be tested. 1. We need additional tests to ensure
that the statistics is installed after restore. 2. Some of those tests
compare dumps before and after restore. In case the statistics is changed
because of auto-analyze happening post-restore, these tests will fail.

I believe, in order to import statistics through IMPORT FOREIGN SCHEMA,
postgresImportForeignSchema() will need to add SELECT commands invoking
pg_set_relation_stats() on each imported table and pg_set_attribute_stats()
on each of its attribute. Am I right? Do we want to make that happen in the
first cut of the feature? How do you expect these functions to be used to
update statistics of foreign tables?

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2024-03-25 10:10:00 Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning
Previous Message Bharath Rupireddy 2024-03-25 10:01:15 Re: Introduce XID age and inactive timeout based replication slot invalidation