Quick Links

Re: Improve docs for n_distinct_inherited

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Improve docs for n_distinct_inherited
Date:	2025-10-13 03:43:46
Message-ID:	CAApHDvq+4ru2AJRMjvQH7kn=8OWURbgcXTDz8JeSvp8o8PAf8g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, 13 Oct 2025 at 14:28, David G. Johnston
<david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
> How about:
>
> "n_disinct is used for normal tables while n_distinct_inherited is used for partitioned tables. Both are usable (selected via the ONLY modifier) for an inheritance parent table since it has both storage and children."

I specifically was trying my best to avoid using the word "normal"
here as it's very indescriptive. Rhetorical question: would you
expect a setting which applies to "normal" tables to be applied to a
materialized view? Instead, I described the cases where
n_distinct_inherited applies, and left it to the reader to assume that
n_distinct applies in all other cases, which IMO is correct. I don't
see the need to talk about ONLY. I'm only discussing stored statistics
in this part, i.e, what's in pg_statistic.

> The use of both "Ordinarily" and "overwritten" is bothering me here. And it implies that n_distinct doesn't work for inheritance parent tables or, conversely, that n_distinct does work for partitioned tables.

I don't know why you think that implies that n_distinct works for
partitioned tables. I've specifically said that "n_distinct_inherited
allows the n_distinct estimate to be overwritten for partitioned
tables". Why would the reader then assume n_distinct is for doing
that?

As for "overwritten", this is exactly what changing this setting does
to the pg_statistic.stadistinct when ANALYZE runs for the table, so as
far as I'm concerned, it's a good and accurate way to describe what
the settings do.

I'm open to finding something better for "Ordinarily
<literal>n_distinct</literal> is used."

>> I also fixed what I thought was some misleading text about ANALYZE
>> using this value to calculate things.
>
>
> > values in the column is linear with the estimated number
> > of rows in the table; the exact count is to be computed by multiplying the estimated
> > rows in the table by the absolute value of the given number.
>
> "...is proportional to the estimated number of rows in the table at planning time."
>
> (The final "at planning time" substitutes for the sentence you pondered removing.)

I've adjusted the "proportional" part, but the "add planning time" at
the end is surplus because we've already mentioned that this is
happening in "the query planner". I doubt anyone needs the additional
reassurance that things that the query planner does happen at planning
time.

> (The rest, including the examples, seem a bit self-explanatory given the definition, though I do get reader inexperience with the terminology. But proportional implies a linear relationship, and the positive/negative bifurcation seems straight-forward here.)

I don't feel the need to remove the examples. I think they're good to
reinforce how the negative values are utilised during planning.

> I'm thinking everything else below is better incorporated into 14.2 which should be linking back to this section. That way the crux of the usage is defined in syntax while the details about setting a specific value are located in the section covering the overall topic.
>
> the exact count is to be computed by multiplying the estimated
> + rows in the table by the absolute value of the given number. For example,
> a value of -1 implies that all values in the column are distinct, while
> - a value of -0.5 implies that each value appears twice on the average.
> + a value of -0.5 implies that each value appears twice on average.
> This can be useful when the size of the table changes over time, since
> the multiplication by the number of rows in the table is not performed
> until query planning time.
>
> (Leave: Specify a value of 0 to revert to estimating...)

Do you mean remove that part? I did consider that as I didn't think
the reader needed to know that as they could use RESET (n_distinct);

> That said, this rework would be OK as-is.
>
> Also, looking at stadistinct, the multiplier stored there accounts for the presence of null. This attribute-option does not. Is that difference worth noting?

It's already becoming more painful to get this changed than I would
have anticipated. Feel free to suggest something on a new thread if
you think it's worthwhile to mention it.

David

In response to

Re: Improve docs for n_distinct_inherited at 2025-10-13 01:27:48 from David G. Johnston

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David Rowley	2025-10-13 03:48:42	Re: Improve docs for n_distinct_inherited
Previous Message	Chao Li	2025-10-13 03:41:15	Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward