Re: Statistics Import and Export

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Statistics Import and Export
Date: 2024-02-13 05:07:26
Message-ID: CADkLM=doQW3TE_LyBwG10NppKZLz0CjAgu10O79uMQc2frYCmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> Also, it says "statistics are replaced" but it's quite clear if that
> applies only to matching statistics or if all stats are deleted first
> and then the new stuff is inserted. (FWIW remove_pg_statistics clearly
> deletes all pre-existing stats).
>

All are now deleted first, both in the pg_statistic and
pg_statistic_ext_data tables. The previous version was taking a more
"replace it if we find a new value" approach, but that's overly
complicated, so following the example set by extended statistics seemed
best.

> - import_pg_statistics: I somewhat dislike that we're passing arguments
> as datum[] array - it's hard to say what the elements are expected to
> be, etc. Maybe we should expand this, to make it clear. How do we even
> know the array is large enough?
>

Completely fair. Initially that was done with the expectation that the
array would be the same for both regular stats and extended stats, but that
was no longer the case.

> - I don't quite understand why we need examine_rel_attribute. It sets a
> lot of fields in the VacAttrStats struct, but then we only use attrtypid
> and attrtypmod from it - so why bother and not simply load just these
> two fields? Or maybe I miss something.
>

I think you're right, we don't need it anymore for regular statistics. We
still need it in extended stats because statext_store() takes a subset of
the vacattrstats rows as an input.

Which leads to a side issue. We currently have 3 functions:
examine_rel_attribute and the two varieties of examine_attribute (one in
analyze.c and the other in extended stats). These are highly similar
but just different enough that I didn't feel comfortable refactoring them
into a one-size-fits-all function, and I was particularly reluctant to
modify existing code for the ANALYZE path.

>
> - examine_rel_attribute can return NULL, but get_attrinfo does not check
> for NULL and just dereferences the pointer. Surely that can lead to
> segfaults?
>

Good catch, and it highlights how little we need VacAttrStats for regular
statistics.

>
> - validate_no_duplicates and the other validate functions would deserve
> a better docs, explaining what exactly is checked (it took me a while to
> realize we check just for duplicates), what the parameters do etc.
>

Those functions are in a fairly formative phase - I expect a conversation
about what sort of validations we want to do to ensure that the statistics
being imported make sense, and under what circumstances we would forego
some of those checks.

>
> - Do we want to make the validate_ functions part of the public API? I
> realize we want to use them from multiple places (regular and extended
> stats), but maybe it'd be better to have an "internal" header file, just
> like we have extended_stats_internal?
>

I see no need to have them be a part of the public API. Will move.

>
> - I'm not sure we do "\set debug f" elsewhere. It took me a while to
> realize why the query outputs are empty ...
>

That was an experiment that rose out of the difficulty in determining
_where_ a difference was when the set-difference checks failed. So far I
like it, and I'm hoping it catches on.

>
>
> 0002
>
> - I'd rename create_stat_ext_entry to statext_create_entry.
>
> - Do we even want to include OIDs from the source server? Why not to
> just have object names and resolve those? Seems safer - if the target
> server has the OID allocated to a different object, that could lead to
> confusing / hard to detect issues.
>

The import functions would obviously never use the imported oids to look up
objects on the destination system. Rather, they're there to verify that the
local object oid matches the exported object oid, which is true in the case
of a binary upgrade.

The export format is an attempt to export the pg_statistic[_ext_data] for
that object as-is, and, as Tom suggested, let the import function do the
transformations. We can of course remove them if they truly have no purpose
for validation.

>
> - What happens if we import statistics which includes data for extended
> statistics object which does not exist on the target machine?
>

The import function takes an oid of the object (relation or extstat
object), and the json payload is supposed to be the stats for ONE
corresponding object. Multiple objects of data really don't fit into the
json format, and statistics exported for an object that does not exist on
the destination system would have no meaningful invocation. I envision the
dump file looking like this

CREATE TABLE public.foo (....);

SELECT pg_import_rel_stats('public.foo'::regclass, <json blob>, option
flag, option flag);

So a call against a nonexistent object would fail on the regclass cast.

>
> - pg_import_ext_stats seems to not use require_match_oids - bug?
>

I haven't yet seen a good way to make use of matching oids in extended
stats. Checking matching operator/collation oids would make sense, but
little else.

>
>
> 0003
>
> - no SGML docs for the new tools?
>

Correct. I foresee the export tool being folded into pg_dump(), and the
import tool going away entirely as psql could handle it.

>
> - The help() seems to be wrong / copied from "clusterdb" or something
> like that, right?
>

Correct, for the reason above.

> >
> > pg_import_rel_stats matches up local columns with the exported stats by
> > column name, not attnum. This allows for stats to be imported when
> columns
> > have been dropped, added, or reordered.
> >
>
> Makes sense. What will happen if we try to import data for extended
> statistics (or index) that does not exist on the target server?
>

One of the parameters to the function is the oid of the object that is the
target of the stats. The importer will not seek out objects with matching
names and each JSON payload is limited to holding one object, though
clearly someone could encapsulate the existing format in a format that has
a manifest of objects to import.

>
> > pg_import_ext_stats can also handle column reordering, though it
> currently
> > would get confused by changes in expressions that maintain the same
> result
> > data type. I'm not yet brave enough to handle importing nodetrees, nor
> do I
> > think it's wise to try. I think we'd be better off validating that the
> > destination extended stats object is identical in structure, and to fail
> > the import of that one object if it isn't perfect.
> >
>
> Yeah, column reordering is something we probably need to handle. The
> stats order them by attnum, so if we want to allow import on a system
> where the attributes were dropped/created in a different way, this is
> necessary. I haven't tested this - is there a regression test for this?
>

The overlong transformation SQL starts with the object to be imported (the
local oid was specified) and it

1. grabs all the attributes (or exprs, for extended stats) of that object.
2. looks for columns/exprs in the exported json for an attribute with a
matching name
3. takes the exported attnum of that exported attribute for use in things
like stdexprs
4. looks up the type, collation, and operators for the exported attribute.

So we get a situation where there might not be importable stats for an
attribute of the destination table, and we'd import nothing for that
column. Stats for exported columns with no matching local column would
never be referenced.

Yes, there should be a test of this.

> I agree expressions are hard. I don't think it's feasible to import
> nodetree from other server versions, but why don't we simply deparse the
> expression on the source, and either parse it on the target (and then
> compare the two nodetrees), or deparse the target too and compare the
> two deparsed expressions? I suspect the deparsing may produce slightly
> different results on the two versions (causing false mismatches), but
> perhaps the deparse on source + parse on target + compare nodetrees
> would work? Haven't tried, though.
>
> > Export formats go back to v10.
> >
>
> Do we even want/need to go beyond 12? All earlier versions are EOL.
>

True, but we had pg_dump and pg_restore stuff back to 7.x until fairly
recently, and a major friction point in getting customers to upgrade their
instances off of unsupported versions is the downtime caused by an upgrade,
why wouldn't we make it easier for them?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-02-13 06:13:46 Re: Collation version tracking for macOS
Previous Message shihao zhong 2024-02-13 04:30:40 Fix incorrect PG_GETARG in pgcrypto