Re: Statistics Import and Export

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: Statistics Import and Export
Date: 2024-03-21 07:27:47
Message-ID: CADkLM=eJXPQD=pWwwm+4Y48haiMmznB8dCECNMrwr59niY8ESw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 21, 2024 at 2:29 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> On Tue, 2024-03-19 at 05:16 -0400, Corey Huinker wrote:
> > v11 attached.
>
> Thank you.
>
> Comments on 0001:
>
> This test:
>
> +SELECT
> + format('SELECT pg_catalog.pg_set_attribute_stats( '
> ...
>
> seems misplaced. It's generating SQL that can be used to restore or
> copy the stats -- that seems like the job of pg_dump, and shouldn't be
> tested within the plain SQL regression tests.
>

Fair enough.

>
> And can the other tests use pg_stats rather than pg_statistic?
>

They can, but part of what I wanted to show was that the values that aren't
directly passed in as parameters (staopN, stacollN) get set to the correct
values, and those values aren't guaranteed to match across databases, hence
testing them in the regression test rather than in a TAP test. I'd still
like to be able to test that.

>
> The function signature for pg_set_attribute_stats could be more
> friendly -- how about there are a few required parameters, and then it
> only sets the stats that are provided and the other ones are either
> left to the existing value or get some reasonable default?
>

That would be problematic.

1. We'd have to compare the stats provided against the stats that are
already there, make that list in-memory, and then re-order what remains
2. There would be no way to un-set statistics of a given stakind, unless we
added an "actually set it null" boolean for each parameter that can be
null.
3. I tried that with the JSON formats, it made the code even messier than
it already was.

Make sure all error paths ReleaseSysCache().
>

+1

>
> Why are you calling checkCanModifyRelation() twice?
>

Once for the relation itself, and once for pg_statistic.

> I'm confused about when the function should return false and when it
> should throw an error. I'm inclined to think the return type should be
> void and all failures should be reported as ERROR.
>

I go back and forth on that. I can see making it void and returning an
error for everything that we currently return false for, but if we do that,
then a statement with one pg_set_relation_stats, and N
pg_set_attribute_stats (which we lump together in one command for the
locking benefits and atomic transaction) would fail entirely if one of the
set_attributes named a column that we had dropped. It's up for debate
whether that's the right behavior or not.

replaces[] is initialized to {true}, which means only the first element
> is initialized to true. Try following the pattern in AlterDatabase (or
> similar) which reads the catalog tuple first, then updates a few fields
> selectively, setting the corresponding element of replaces[] along the
> way.
>

+1.

>
> The test also sets the most_common_freqs in an ascending order, which
> is weird.
>

I pulled most of the hardcoded values from pg_stats itself. The sample set
is trivially small, and the values inserted were in-order-ish. So maybe
that's why.

> Relatedly, I got worried recently about the idea of plain users
> updating statistics. In theory, that should be fine, and the planner
> should be robust to whatever pg_statistic contains; but in practice
> there's some risk of mischief there until everyone understands that the
> contents of pg_stats should not be trusted. Fortunately I didn't find
> any planner crashes or even errors after a brief test.
>

Maybe we could have the functions restricted to a role or roles:

1. pg_write_all_stats (can modify stats on ANY table)
2. pg_write_own_stats (can modify stats on tables owned by user)

I'm iffy on the need for the first one, I list it first purely to show how
I derived the name for the second.

> One thing we can do is some extra validation for consistency, like
> checking that the arrays are properly sorted, check for negative
> numbers in the wrong place, or fractions larger than 1.0, etc.
>

+1. All suggestions of validation checks welcome.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2024-03-21 07:35:08 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Bertrand Drouvot 2024-03-21 07:10:16 Re: Introduce XID age and inactive timeout based replication slot invalidation