Re: Statistics Import and Export

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: Statistics Import and Export
Date: 2024-03-08 19:17:31
Message-ID: CADkLM=eM7rqJpArtRrxsVbOn=3GBRUyKtQbC_TxtFe010QLR8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> Perhaps it's just me ... but it doesn't seem like it's all that many
>
parameters.
>

It's more than I can memorize, so that feels like a lot to me. Clearly it's
not insurmountable.

> > I am a bit concerned about the number of locks on pg_statistic and the
> > relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> > attribute rather than once per relation. But I also see that this will
> > mostly get used at a time when no other traffic is on the machine, and
> > whatever it costs, it's still faster than the smallest table sample
> (insert
> > joke about "don't have to be faster than the bear" here).
>
> I continue to not be too concerned about this until and unless it's
> actually shown to be an issue. Keeping things simple and
> straight-forward has its own value.
>

Ok, I'm sold on that plan.

>
> > +/*
> > + * Set statistics for a given pg_class entry.
> > + *
> > + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> > + *
> > + * This does an in-place (i.e. non-transactional) update of pg_class,
> just as
> > + * is done in ANALYZE.
> > + *
> > + */
> > +Datum
> > +pg_set_relation_stats(PG_FUNCTION_ARGS)
> > +{
> > + const char *param_names[] = {
> > + "relation",
> > + "reltuples",
> > + "relpages",
> > + };
> > +
> > + Oid relid;
> > + Relation rel;
> > + HeapTuple ctup;
> > + Form_pg_class pgcform;
> > +
> > + for (int i = 0; i <= 2; i++)
> > + if (PG_ARGISNULL(i))
> > + ereport(ERROR,
> > +
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("%s cannot be NULL",
> param_names[i])));
>
> Why not just mark this function as strict..? Or perhaps we should allow
> NULLs to be passed in and just not update the current value in that
> case?

Strict could definitely apply here, and I'm inclined to make it so.

> Also, in some cases we allow the function to be called with a
> NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> OID ends up being NULL).

Thoughts on it emitting a WARN or NOTICE before returning false?

> Not sure if that makes sense here or not
> offhand but figured I'd mention it as something to consider.
>
> > + pgcform = (Form_pg_class) GETSTRUCT(ctup);
> > + pgcform->reltuples = PG_GETARG_FLOAT4(1);
> > + pgcform->relpages = PG_GETARG_INT32(2);
>
> Shouldn't we include relallvisible?
>

Yes. No idea why I didn't have that in there from the start.

> Also, perhaps we should use the approach that we have in ANALYZE, and
> only actually do something if the values are different rather than just
> always doing an update.
>

That was how it worked back in v1, more for the possibility that there was
no matching JSON to set values.

Looking again at analyze.c (currently lines 1751-1780), we just check if
there is a row in the way, and if so we replace it. I don't see where we
compare existing values to new values.

>
> > +/*
> > + * Import statistics for a given relation attribute
> > + *
> > + * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
> > + * stanullfrac float4, stawidth int, stadistinct
> float4,
> > + * stakind1 int2, stakind2 int2, stakind3 int3,
> > + * stakind4 int2, stakind5 int2, stanumbers1
> float4[],
> > + * stanumbers2 float4[], stanumbers3 float4[],
> > + * stanumbers4 float4[], stanumbers5 float4[],
> > + * stanumbers1 float4[], stanumbers2 float4[],
> > + * stanumbers3 float4[], stanumbers4 float4[],
> > + * stanumbers5 float4[], stavalues1 text,
> > + * stavalues2 text, stavalues3 text,
> > + * stavalues4 text, stavalues5 text);
> > + *
> > + *
> > + */
>
> Don't know that it makes sense to just repeat the function declaration
> inside a comment like this- it'll just end up out of date.
>

Historical artifact - previous versions had a long explanation of the JSON
format.

>
> > +Datum
> > +pg_set_attribute_stats(PG_FUNCTION_ARGS)
>
> > + /* names of columns that cannot be null */
> > + const char *required_param_names[] = {
> > + "relation",
> > + "attname",
> > + "stainherit",
> > + "stanullfrac",
> > + "stawidth",
> > + "stadistinct",
> > + "stakind1",
> > + "stakind2",
> > + "stakind3",
> > + "stakind4",
> > + "stakind5",
> > + };
>
> Same comment here as above wrt NULL being passed in.
>

In this case, the last 10 params (stanumbersN and stavaluesN) can be null,
and are NULL more often than not.

>
> > + for (int k = 0; k < 5; k++)
>
> Shouldn't we use STATISTIC_NUM_SLOTS here?
>

Yes, I had in the past. Not sure why I didn't again.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-03-08 19:26:44 Re: speed up a logical replica setup
Previous Message Peter Geoghegan 2024-03-08 19:14:04 Re: btree: downlink right separator/HIKEY optimization