Re: Statistics Import and Export

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Statistics Import and Export
Date: 2024-03-25 23:16:48
Message-ID: bf724b21-914a-4497-84e3-49944f9776f6@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/25/24 09:27, Corey Huinker wrote:
> On Fri, Mar 22, 2024 at 9:51 PM Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
> wrote:
>
>> v12 attached.
>>
>>
> v13 attached. All the same features as v12, but with a lot more type
> checking, bounds checking, value inspection, etc. Perhaps the most notable
> feature is that we're now ensuring that histogram values are in ascending
> order. This could come in handy for detecting when we are applying stats to
> a column of the wrong type, or the right type but with a different
> collation. It's not a guarantee of validity, of course, but it would detect
> egregious changes in sort order.
>

Hi,

I did take a closer look at v13 today. I have a bunch of comments and
some minor whitespace fixes in the attached review patches.

0001
----

1) The docs say this:

<para>
The purpose of this function is to apply statistics values in an
upgrade situation that are "good enough" for system operation until
they are replaced by the next <command>ANALYZE</command>, usually via
<command>autovacuum</command> This function is used by
<command>pg_upgrade</command> and <command>pg_restore</command> to
convey the statistics from the old system version into the new one.
</para>

I find this a bit confusing, considering the pg_dump/pg_restore changes
are only in 0002, not in this patch.

2) Also, I'm not sure about this:

<parameter>relation</parameter>, the parameters in this are all
derived from <structname>pg_stats</structname>, and the values
given are most often extracted from there.

How do we know where do the values come from "most often"? I mean, where
else would it come from?

3) The function pg_set_attribute_stats() is veeeeery long - 1000 lines
or so, that's way too many for me to think about. I agree the flow is
pretty simple, but I still wonder if there's a way to maybe split it
into some smaller "meaningful" steps.

4) It took me *ages* to realize the enums at the beginning of some of
the functions are actually indexes of arguments in PG_FUNCTION_ARGS.
That'd surely deserve a comment explaining this.

5) The comment for param_names in pg_set_attribute_stats says this:

/* names of columns that cannot be null */
const char *param_names[] = { ... }

but isn't that actually incorrect? I think that applies only to a couple
initial arguments, but then other fields (MCV, mcelem stats, ...) can be
NULL, right?

6) There's a couple minor whitespace fixes or comments etc.

0002
----

1) I don't understand why we have exportExtStatsSupported(). Seems
pointless - nothing calls it, even if it did we don't know how to export
the stats.

2) I think this condition in dumpTableSchema() is actually incorrect:

if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
(tbinfo->relkind == RELKIND_VIEW))
dumpRelationStats(fout, &tbinfo->dobj, reltypename,

Aren't indexes pretty much exactly the thing for which we don't want to
dump statistics? In fact this skips dumping statistics for table - if
you dump a database with a single table (-Fc), pg_restore -l will tell
you this:

217; 1259 16385 TABLE public t user
3403; 0 16385 TABLE DATA public t user

Which is not surprising, because table is not a view. With an expression
index you get this:

217; 1259 16385 TABLE public t user
3404; 0 16385 TABLE DATA public t user
3258; 1259 16418 INDEX public t_expr_idx user
3411; 0 0 STATS IMPORT public INDEX t_expr_idx

Unfortunately, fixing the condition does not work:

$ pg_dump -Fc test > test.dump
pg_dump: warning: archive items not in correct section order

This happens for a very simple reason - the statistics are marked as
SECTION_POST_DATA, which for the index works, because indexes are in
post-data section. But the table stats are dumped right after data,
still in the "data" section.

IMO that's wrong, the statistics should be delayed to the post-data
section. Which probably means there needs to be a separate dumpable
object for statistics on table/index, with a dependency on the object.

3) I don't like the "STATS IMPORT" description. For extended statistics
we dump the definition as "STATISTICS" so why to shorten it to "STATS"
here? And "IMPORT" seems more like the process of loading data, not the
data itself. So I suggest "STATISTICS DATA".

regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-Create-pg_set_relation_stats-pg_set_attribute_stats.patch text/x-patch 119.9 KB
0002-review.patch text/x-patch 4.1 KB
0003-Enable-dumping-of-table-index-stats-in-pg_dump.patch text/x-patch 21.3 KB
0004-review.patch text/x-patch 2.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2024-03-25 23:21:40 Re: WIP Incremental JSON Parser
Previous Message Tom Lane 2024-03-25 23:15:52 Re: Regression tests fail with musl libc because libpq.so can't be loaded