Re: Re: Extended Statistics set/restore/clear functions.

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: WangYu <wangyu_runtime(at)163(dot)com>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-hackers(at)lists(dot)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: Re: Extended Statistics set/restore/clear functions.
Date: 2025-12-04 23:12:16
Message-ID: CADkLM=csMd52i39Ye8-PUUHyzBb3546eSCUTh-FBQ7bzT2uZ4Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
>
> Indeed. These exact typos existed in some of the previous iterations
> of the patch posted on this thread.
>

+1

> 1) statatt_get_type(), that retrieves a set of information from a
> relation and one of its attributes. This is lightly documented in
> extended_statistics_update() from 0002, with an argument claiming that
> we do the same as attribute statistics. If one then looks at
> attribute_stats.c, we have a not-really-helpful "derive information
> from attribute". The trick is that we have an implied dependency
> here: statatt_get_type() needs to be called *before*
> statatt_get_elem_type() for attribute *and* extended stats,
> statatt_get_elem_type() being fed data from statatt_get_type().
>

I've added comments to that effect.

>
> 2) statatt_get_elem_type(), interesting piece of the puzzle that acts
> as a wrapper for the type cache. An important part here, it seems to
> me, would be to at least tell that this relies on data retrieved from
> statatt_get_type(), initially. Once I've noticed this dependency the
> logic felt a bit cleaner to understand, but we have no docs explaining
> any of that..
>

I've added comments that will hopefully connect those dots.

>
> 3) statatt_init_empty_tuple(), that initializes a set of data to be
> used for updates and/or insertion. My first thought here is about
> ...
> not apply to extended stats. But one would have to guess this fact
> after knowing that this relates to the catalog pg_statistic..
>

The comments didn't seem quite so illuminating, but I added them just the
same.

>
> 4) statatt_set_slot is slightly simpler. Still here, it is really easy
> to miss that this routine should be fed data retrieved by
> statatt_get_type(). staop can also be an eq or an lt operator.
> stakind is one of the STATISTIC_KIND_* values, etc, etc.
>

Some here too.

> As far as I can see, there is nothing in these functions that require
> them to be located in attribute_stats.c anymore. Let's move that to
> stats_utils.c, common place for all the shared facilities used by
> these stats modules. Or it could also be a new, separate patch, for
> the manipulation and extraction of the tuple data related to
> the pg_statistic tuples.
>

They've been moved to stats_utils.c, and impact was fairly minimal. I was
worried about an explosion of #includes, but that ended up not being the
case.

> Perhaps you know what these imply because you have written this code
> originally in ce207d2a7901, but exposing them also means that it is
> very important to document at least some concepts and assumptions
> around them so as it is possible to guide somebody that may want to
> use this code:
> - What are the input arguments from?
> - And what are the results?
> - In which circumstances should these be used?
>

An attempt was made.

> Then about 0002.
>
> There are a bunch of assumptions embedded inside import_expressions()
> that are interdependent with the calls of these attribute routines
> ...
> each STATISTIC_KIND_*. Too much is let to the reader to guess, making
> the code complicated to maintain as written, at least IMHO.
>

I tried to expand on these things, but I haven't yet moved them to another
file for reasons I'll explain later.

> Is there any need to locate these new functions and code in
> extended_stats.c, actually? A separation into a new file seems like
> it would be a cleaner result, leaving extended_stats.c to deal with
> the import of the new data, including the fetch and deletion of any
> existing data in pg_stat_ext that would be updated or inserted.
> Perhaps name that extended_stats_funcs.c, as these are about the
> direct SQL functions, import and deletion.
>

We can move them to another file, but I'm not sure if that will actually
make things clearer. What is extended_stats.c if not functions about
extended stats? Perhaps it makes more sense to try harder to find the
commonality in the building of the pg_statistic tuples, move those
functions to stats_utils, and get things decluttered that way?

For the tests, it looks like it would be better to have everything in
> a new file, like a stats_ext_import.sql for the clear and restore
> bits.
>

Those tests very much build upon the tables and data already created for
regular relation/attribute imports, so we save quite a bit of boilerplate
in keeping them there.

>
> A lot of the tests are copy-pastes of surrounding queries. Perhaps it
> would be better to use a SQL function wrapper or an IN clause with
> multiple relations?

I'd say "no", because we're doing a lot of things that *could* generate
ERRORs, and we're verifying that they generate WARNINGs instead. If we did
batch up those commands, and any one of them generated an error, we'd be at
loss for which one was the culprit, and we'd also lose any insight into
what else got busted in the other rows of the batch.

> The patch has a lot of bloat with these repeated
> queries.. Perhaps we should use a split for the elements in the
> extended stats array instead of jsonb_pretty(). The results get quite
> long here.
>

I replaced '}, ' with '},\n' and removed the jsonb_pretty entirely, which
is something I had proposed doing earlier, as it's actually more pretty
than jsonb_pretty(). It's still in the same file, though.

I like the net result of this, and it might make sense to search other
regression tests for places where we can employ this mini-pretty pattern.

> As of 0002, there are actually two independent pieces dealt with: the
> deletion of extended stats with pg_clear_extended_stats(), and the
> insert/update of extended stats with pg_restore_extended_stats(). I'd
> suggest to split that into two parts, with the clear being first in
> rank. The deletion is simpler, and getting that in first simplifies
> the review of the import part with less input to deal with.
>

The pg_clear function is quite trivial, as it's basically just marshalling
and locking before a call to delete_pg_statistic_ext_data(). So while I can
split them out, I'd also have to split the pg_proc.dat changes into two
separate ones, which makes it hard to see how similar they are. I have a
similar feeling about the changes to func-admin.sgml.

All of that, plus a rebase.

Attachment Content-Type Size
v20-0001-Expose-attribute-statistics-functions-for-use-in.patch text/x-patch 30.9 KB
v20-0002-Add-extended-statistics-support-functions.patch text/x-patch 103.3 KB
v20-0003-Include-Extended-Statistics-in-pg_dump.patch text/x-patch 13.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-12-04 23:14:58 Re: [Proposal] Adding callback support for custom statistics kinds
Previous Message Paul A Jungwirth 2025-12-04 23:11:36 Clarify temporal FK comments