Re: PoC/WIP: Extended statistics on expressions

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PoC/WIP: Extended statistics on expressions
Date: 2021-01-17 02:55:20
Message-ID: CALNJ-vRz-hsCFVNQLm7Amxse+J=dLwSuwbJNeYOk-9mwvh_AyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,
+ * Check that only the base rel is mentioned. (This should be dead code
+ * now that add_missing_from is history.)
+ */
+ if (list_length(pstate->p_rtable) != 1)

If it is dead code, it can be removed, right ?

For statext_mcv_clauselist_selectivity:

+ // bms_free(list_attnums[listidx]);

The commented line can be removed.

+bool
+examine_clause_args2(List *args, Node **exprp, Const **cstp, bool
*expronleftp)

Better add some comment for examine_clause_args2 since there
is examine_clause_args() already.

+ if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)

When would stats->minrows have negative value ?

For serialize_expr_stats():

+ sd = table_open(StatisticRelationId, RowExclusiveLock);
+
+ /* lookup OID of composite type for pg_statistic */
+ typOid = get_rel_type_id(StatisticRelationId);
+ if (!OidIsValid(typOid))
+ ereport(ERROR,

Looks like the table_open() call can be made after the typOid check.

+ Datum values[Natts_pg_statistic];
+ bool nulls[Natts_pg_statistic];
+ HeapTuple stup;
+
+ if (!stats->stats_valid)

It seems the local arrays can be declared after the validity check.

+ if (enabled[i] == STATS_EXT_NDISTINCT)
+ ndistinct_enabled = true;
+ if (enabled[i] == STATS_EXT_DEPENDENCIES)
+ dependencies_enabled = true;
+ if (enabled[i] == STATS_EXT_MCV)

the second and third if should be preceded with 'else'

+ReleaseDummy(HeapTuple tuple)
+{
+ pfree(tuple);

Since ReleaseDummy() is just a single pfree call, maybe you don't need this
method - call pfree in its place.

Cheers

On Sat, Jan 16, 2021 at 4:24 PM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> On 1/17/21 12:22 AM, Justin Pryzby wrote:
> > On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote:
> >> + <entry role="catalog_table_entry"><para role="column_definition">
> >> + <structfield>expr</structfield> <type>text</type>
> >> + </para>
> >> + <para>
> >> + Expression the extended statistics is defined on
> >> + </para></entry>
> >
> > Expression the extended statistics ARE defined on
> > Or maybe say "on which the extended statistics are defined"
> >
>
> I'm pretty sure "is" is correct because "expression" is singular.
>
> >> + <para>
> >> + The <command>CREATE STATISTICS</command> command has two basic
> forms. The
> >> + simple variant allows to build statistics for a single expression,
> does
> >
> > .. ALLOWS BUILDING statistics for a single expression, AND does (or BUT
> does)
> >
> >> + Expression statistics are per-expression and are similar to
> creating an
> >> + index on the expression, except that they avoid the overhead of the
> index.
> >
> > Maybe say "overhead of index maintenance"
> >
>
> Yeah, that sounds better.
>
> >> + All functions and operators used in a statistics definition must be
> >> + <quote>immutable</quote>, that is, their results must depend only on
> >> + their arguments and never on any outside influence (such as
> >> + the contents of another table or the current time). This
> restriction
> >
> > say "outside factor" or "external factor"
> >
>
> In fact, we've removed the immutability restriction, so this paragraph
> should have been removed.
>
> >> + results of those expression, and uses default estimates as
> illustrated
> >> + by the first query. The planner also does not realize the value of
> the
> >
> > realize THAT
> >
>
> OK, changed.
>
> >> + second column fully defines the value of the other column, because
> date
> >> + truncated to day still identifies the month. Then expression and
> >> + ndistinct statistics are built on those two columns:
> >
> > I got an error doing this:
> >
> > CREATE TABLE t AS SELECT generate_series(1,9) AS i;
> > CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
> > ANALYZE t;
> > SELECT i+1 FROM t GROUP BY 1;
> > ERROR: corrupt MVNDistinct entry
> >
>
> Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting
> in mismatching the ndistinct coefficient items. The attached patch fixes
> that, but I've realized the way we pick the "best" statistics may need
> some improvements (I added an XXX comment about that).
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2021-01-17 05:16:09 RE: Disable WAL logging to speed up data loading
Previous Message Tom Lane 2021-01-17 02:12:03 Calculation of relids (pull_varnos result) for PlaceHolderVars