Re: PoC/WIP: Extended statistics on expressions

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PoC/WIP: Extended statistics on expressions
Date: 2021-01-18 15:48:28
Message-ID: CAEZATCVnQccV_Z7AuvbvquaOYxUj+cVFk48pPGASYHFjXWgW5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Looking through extended_stats.c, I found a corner case that can lead
to a seg-fault:

CREATE TABLE foo();
CREATE STATISTICS s ON (1) FROM foo;
ANALYSE foo;

This crashes in lookup_var_attr_stats(), because it isn't expecting
nvacatts to be 0. I can't think of any case where building stats on a
table with no analysable columns is useful, so it should probably just
exit early in that case.

In BuildRelationExtStatistics(), it looks like min_attrs should be
declared assert-only.

In evaluate_expressions():

+ /* set the pointers */
+ result = (ExprInfo *) ptr;
+ ptr += sizeof(ExprInfo);

I think that should probably have a MAXALIGN().

A slightly bigger issue that I don't like is the way it assigns
attribute numbers for expressions starting from
MaxHeapAttributeNumber+1, so the first expression has an attnum of
1601. That leads to pretty inefficient use of Bitmapsets, since most
tables only contain a handful of columns, and there's a large block of
unused space in the middle the Bitmapset.

An alternative approach might be to use regular attnums for columns
and use negative indexes -1, -2, -3, ... for expressions in the stored
stats. Then when storing and retrieving attnums from Bitmapsets, it
could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
in the Bitmapsets, since there can't be more than that many
expressions (just like other code stores system attributes using
FirstLowInvalidHeapAttributeNumber).

That would be a somewhat bigger change, but hopefully fairly
mechanical, and then some code like add_expressions_to_attributes()
would go away.

Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
show expressions until the statistics have been built. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

statistics_name | tablename | expr | n_distinct
-----------------+-----------+------+------------
s | foo | |
(1 row)

but after populating and analysing the table, this becomes:

statistics_name | tablename | expr | n_distinct
-----------------+-----------+---------+------------
s | foo | (a + b) | 11
s | foo | (a * b) | 11
(2 rows)

I think it should show the expressions even before the stats have been built.

Another issue is that it returns rows for non-expression stats as
well. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON a, b FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

statistics_name | tablename | expr | n_distinct
-----------------+-----------+------+------------
s | foo | |
(1 row)

and those values will never be populated, since they're not
expressions, so I would expect them to not be shown in the view.

So basically, instead of

+ LEFT JOIN LATERAL (
+ SELECT
+ *
+ FROM (
+ SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+ unnest(sd.stxdexpr)::pg_statistic AS a
+ ) x
+ ) stat ON sd.stxdexpr IS NOT NULL;

perhaps just

+ JOIN LATERAL (
+ SELECT
+ *
+ FROM (
+ SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+ unnest(sd.stxdexpr)::pg_statistic AS a
+ ) x
+ ) stat ON true;

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2021-01-18 15:50:37 Re: Key management with tests
Previous Message Bruce Momjian 2021-01-18 15:43:58 Re: Key management with tests