Re: Extended Statistics set/restore/clear functions.

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, jian he <jian(dot)universality(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: Extended Statistics set/restore/clear functions.
Date: 2025-11-07 22:28:50
Message-ID: CADkLM=dWQ3r48eAP8NggLqe90_16JKbit9iu9AtuUrZ8+A=qBA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
>
> Patch 0001 for ndistinct was missing a documentation update, we have
> one query in perform.sgml that looks at stxdndistinct. Patch 0003 is
> looking OK here as well.
>

Well spotted.

> For dependencies, the format switches from a single json object
> with key/vals like that:
> "3 => 4": 1.000000
> To a JSON array made of elements like that:
> {"degree": 1.000000, "attributes": [3],"dependency": 4},
>
> For ndistincts, we move from a JSON blob with key/vals like that:
> "3, 4": 11
> To a JSON array made of the following elements:
> {"ndistinct": 11, "attributes": [3,4]}
>
> Using a keyword within each element would force a stronger validation
> when these get imported back, which is a good thing. I like that.
>
> Before going in-depth into the input functions to cross-check the
> amount of validation we should do, have folks any comments about the
> proposed format? That's the key point this patch set depends on, and
> I'd rather not spend more time the whole thing if somebody would like
> a different format. This is the format that Tomas has mentioned at
> the top of the thread. Note: as noted upthread, pg_dump would be in
> charge of transferring the data of the old format to the new format at
> the end.
>

I'm open to other formats, but aside from renaming the json keys (maybe
"attnums" or "keys" instead of "attributes"?), I'm not sure what really
could be done and still be JSON. I suppose we could go with a tuple format
like this:

'{({3,4},11),...}' for pg_ndistinct and
'{({3},4,1.00000),...}' for pg_dependencies.

Those would certainly be more compact, but makes for a hard read by humans,
and while the JSON code is big, it's also proven in other parts of the
codebase, hence less risky.

>
> While looking at 0002 and 0004 (which have a couple of issues
> actually), I have been wondering about moving into a new file the four
> data-type functions (in, out, send and receive) and the new input
> functions that rely on a new JSON lexer and parser logic into for both
> ndistinct and dependencies. The new set of headers added at the top
> of mvdistinct.c and dependencies.c for the new code points that a
> separation may be better in the long-term, because the new code relies
> on parts of the backend that the existing code does not care about,
> and these files become larger than the relation and attribute stats
> files. I would be tempted to name these new files pg_dependencies.c
> and pg_ndistinct.c, mapping with their catalog types. With this
> separation, it looks like the "core" parts in charge of the
> calculations with ndistinct and dependencies can be kept on its own.
> What do you think?
>

A part of me thinks that everything that remains after removing
in/out/send/recv is just taking a table sample data structure and crunching
numbers to come up with the deserialized data structure...that's in/out
with a different starting/ending points.

There's no denying that JSON parsing is a very different code style than
statistical number crunching, and mixing the two is incongruous, so it's
worth a shot, and I'll try that for v9.

>
> A second comment is for 0005. The routines of attributes.c are
> applied to the new clear and restore functions. Shouldn't these be in
> stats_utils.c at the end? That's where the "common" functions used by
> the stats manipulation logic are.
>

I assume you're referring to attribute_stats.c. I think that would cause
stats_utils.c to have to pull in a lot of things from attribute_stats.c,
and that would create the exact sort of include-pollution that you're
trying to avoid in the mvdistinct.c/dependencies.c situation mentioned
above.

The one lone exception to this is text_to_stavalues(), which is a fancy
wrapper around array_in() and could perhaps be turned to even more generic
usage outside of stats in general.

The functions in question are needed because the exprs value is itself an
array of partly-filled-out pg_attribute tuples, so it's common to those two
needs, but specific to stats about attributes. Maybe we need an
attr_stats_utils.h?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Manni Wood 2025-11-07 22:38:32 Re: [PATCH] Add pg_get_tablespace_ddl() function to reconstruct CREATE TABLESPACE statement
Previous Message Philip Alger 2025-11-07 22:06:59 Re: [PATCH] Add pg_get_trigger_ddl() to retrieve the CREATE TRIGGER statement