Re: Extended Statistics set/restore/clear functions.

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: jian he <jian(dot)universality(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-hackers(at)lists(dot)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: Extended Statistics set/restore/clear functions.
Date: 2025-11-18 02:32:37
Message-ID: CADkLM=d7fu9k03CD60eHciF3bUbd1-ANSD8VjVxsjMyAL1HVGQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
>
> This feels like a different pretty still compressed output for json.
> I don't think we should change the output functions to do that, but if
> you want to add a function that filters these contents a bit in the
> tests for the input functions, sure, why not.
>

+1, I'll probably just use the replaces rather than define a function and
give someone the false impression that the function exists elsewhere.

>
> Yes, this one is to reduce the translation work, and because the
> messages are quite the same across the board and deal with the same
> requirements:
> - Single integer expected after a key (attnum or actual value).
> - Array of attribute expected after a key.
> - For the degree key, float value.
>
> > I had a feeling that was going to be requested. My question would be if
> > that we want to stick to modeling the other combinations after the first
> > longest combination, last longest, or if we want to defer those checks
> > altogether until we have to validate against an actual stats object?
>
> I would tend to think that performing one round of validation once the
> whole set of objects has been parsed is going to be cheaper than
> periodic checks.
>
> One other thing would be to force a sort of the elements in the array
> to match with the order these are generated when creating the stats.
> We cannot do that in the input functions because we have no idea about
> the order of the attributes in the statistics object yet. Applying a
> sort sounds also important to me to make sure that we order the stats
> based on what the group generation functions (aka
> generate_combinations(), etc.) think on the matter, which would
> enforce a stronger binary compatibility after we are sure that we have
> a full set of attributes listed in an array with the input function of
> course. I have briefly looked at the planner code where extended
> stats are used, like selfuncs.c, and the ordering does not completely
> matter, it seems, but it's cheap enough to enforce a stricter ordering
> based on the K groups of N elements generated in the import function.
>
> >> Except for this argument, the input of pg_ndistinct feels OK in terms
> >> of the guarantees that we'd want to enforce on an import. The same
> >> argument applies in terms of attribute number guarantees for
> >> pg_dependencies, based on DependencyGenerator_init() & friends in
> >> dependencies.c. Could you look at that?
> >
> > Yes. I had already looked at it to verify that _all_ combinations were
> > always generated (they are), because I had some vague memory of the
> > generator dropping combinations that were statistically insignificant. In
> > retrospect, I have no idea where I got that idea.
>
> Hmm. I would need to double-check the code to be sure, but I don't
> think that we drop combinations, because the code prevents duplicates
> to begin with, even for expressions:
> create table aa (a int, b int);
> create statistics stats (ndistinct) ON a, a, b, b from aa;
> ERROR: 42701: duplicate column name in statistics definition
> create statistics stats (ndistinct) ON (a + a), ((a + a)) from aa;
> ERROR: 42701: duplicate expression in statistics definition
>

So I looked at the generator functions, hoping they'd have enough in common
that they could be made generic. And they're just different enough that I
think it's not worth it to try.

But, if we don't care about the order of the combinations, I also don't
think we need to expose the functions at all. We know exactly how many
combinations there should be for any N attributes as each attribute must be
unique. So if we have the right number of unique combinations, and they're
all subsets of the first-longest, then we must have a complete set.
Thoughts on that?

Getting _too_ tight with the ordering and contents makes me concerned for
the day when the format might change. We don't want to _fail_ an upgrade
because some of the combinations were in the wrong order.

> These don't make sense anyway because they have a predictible and
> perfectly matching correlation relationship.
>

They do, for now, but are we willing to lock ourselves into that forever?

>
> > This is fairly simple to do. The dependency attnum is just appended to
> the
> > list of attnums, and the combinations are generated the same as
> ndistinct,
> > though obviously there are no single elements.
>
> Yeah. That should be not be bad, I hope.
>
> > There's probably some common code between the lists to be shared,
> differing
> > only in how they report missing combinations.
>
> I would like to agree on that, but it did not look that obvious to me
> yesterday. If you think that something could be refactored, I'd
> suggest a refactoring patch that applies on top of the rest of the
> patch set, with new generic facilities in stat_util.c, or even a
> new separate file, if that leads to a cleaner result (okay, a
> definition of "clean" is up to one's taste).
>

Looking over those functions, they both could have use the same generator,
but the dependencies-side decided that dependency order doesn't matter,
which puts doubt in my head that the order is perfectly the same for both,
so we'd better follow each individually IF we want to enforce order.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2025-11-18 02:33:20 Re: Row pattern recognition
Previous Message Japin Li 2025-11-18 02:29:11 Re: [WIP]Vertical Clustered Index (columnar store extension) - take2