Re: PoC/WIP: Extended statistics on expressions

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PoC/WIP: Extended statistics on expressions
Date: 2021-03-24 16:38:32
Message-ID: ccd0e47a-2326-ffdd-657b-95b79f41db50@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 3/24/21 7:24 AM, Justin Pryzby wrote:
> Most importantly, it looks like this forgets to update catalog documentation
> for stxexprs and stxkind='e'
>

Good catch.

> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me. For example:
>> Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".
>

OK "statistics object" seems better and more consistent.

>> + Name of schema containing table
>
> I don't know about the nearby descriptions, but this one sounds too much like a
> "schema-containing" table. Say "Name of the schema which contains the table" ?
>

I think the current spelling is OK / consistent with the other catalogs.

>> + Name of table
>
> Say "name of table on which the extended statistics are defined"
>

I've used "Name of table the statistics object is defined on".

>> + Name of extended statistics
>
> "Name of the extended statistic object"
>
>> + Owner of the extended statistics
>
> ..object
>

OK

>> + Expression the extended statistics is defined on
>
> I think it should say "the extended statistic", or "the extended statistics
> object". Maybe "..on which the extended statistic is defined"
>

OK

>> + of random access to the disk. (This expression is null if the expression
>> + data type does not have a <literal>&lt;</literal> operator.)
>
> expression's data type
>

OK

>> + much-too-small row count estimate in the first two queries. Moreover, the
>
> maybe say "dramatically underestimates the rowcount"
>

I've changed this to "... results in a significant underestimate of row
count".

>> + planner has no information about relationship between the expressions, so it
>
> the relationship
>

OK

>> + assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
>> + conditions are independent, and multiplies their selectivities together to
>> + arrive at a much-too-high group count estimate in the aggregate query.
>
> severe overestimate ?
>

OK

>> + This is further exacerbated by the lack of accurate statistics for the
>> + expressions, forcing the planner to use default ndistinct estimate for the
>
> use *a default
>

OK

>> + expression derived from ndistinct for the column. With such statistics, the
>> + planner recognizes that the conditions are correlated and arrives at much
>> + more accurate estimates.
>
> are correlated comma
>

OK

>> + if (type->lt_opr == InvalidOid)
>
> These could be !OidIsValid
>

Maybe, but it's like this already. I'll leave this alone and then
fix/backpatch separately.

>> + * expressions. It's either expensive or very easy to defeat for
>> + * determined used, and there's no risk if we allow such statistics (the
>> + * statistics is useless, but harmless).
>
> I think it's meant to say "for a determined user" ?
>

Right.

>> + * If there are no simply-referenced columns, give the statistics an auto
>> + * dependency on the whole table. In most cases, this will be redundant,
>> + * but it might not be if the statistics expressions contain no Vars
>> + * (which might seem strange but possible).
>> + */
>> + if (!nattnums)
>> + {
>> + ObjectAddressSet(parentobject, RelationRelationId, relid);
>> + recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
>> + }
>
> Can this be unconditional ?
>

What would be the benefit? This behavior copied from index_create, so
I'd prefer keeping it the same for consistency reason. Presumably it's
like that for some reason (a bit of cargo cult programming, I know).

>> + * Translate the array of indexs to regular attnums for the dependency (we
>
> sp: indexes
>

OK

>> + * Not found a matching expression, so we can simply skip
>
> Found no matching expr
>

OK

>> + /* if found a matching, */
>
> matching ..
>

Matching dependency.

>> +examine_attribute(Node *expr)
>
> Maybe you should rename this to something distinct ? So it's easy to add a
> breakpoint there, for example.
>

What would be a better name? It's not difficult to add a breakpoint
using line number, for example.

>> + stats->anl_context = CurrentMemoryContext; /* XXX should be using
>> + * something else? */
>
>> + bool nulls[Natts_pg_statistic];
> ...
>> + * Construct a new pg_statistic tuple
>> + */
>> + for (i = 0; i < Natts_pg_statistic; ++i)
>> + {
>> + nulls[i] = false;
>> + }
>
> Shouldn't you just write nulls[Natts_pg_statistic] = {false};
> or at least: memset(nulls, 0, sizeof(nulls));
>

Maybe, but it's a copy of what update_attstats() does, so I prefer
keeping it the same.

>> + * We don't store collations used to build the statistics, but
>> + * we can use the collation for the attribute itself, as
>> + * stored in varcollid. We do reset the statistics after a
>> + * type change (including collation change), so this is OK. We
>> + * may need to relax this after allowing extended statistics
>> + * on expressions.
>
> This text should be updated or removed ?
>

Yeah, the last sentence is obsolete. Updated.

>> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname,
>> }
>>
>> /* print any extended statistics */
>> - if (pset.sversion >= 100000)
>> + if (pset.sversion >= 140000)
>> + {
>> + printfPQExpBuffer(&buf,
>> + "SELECT oid, "
>> + "stxrelid::pg_catalog.regclass, "
>> + "stxnamespace::pg_catalog.regnamespace AS nsp, "
>> + "stxname,\n"
>> + "pg_get_statisticsobjdef_columns(oid) AS columns,\n"
>> + " 'd' = any(stxkind) AS ndist_enabled,\n"
>> + " 'f' = any(stxkind) AS deps_enabled,\n"
>> + " 'm' = any(stxkind) AS mcv_enabled,\n");
>> +
>> + if (pset.sversion >= 130000)
>> + appendPQExpBufferStr(&buf, " stxstattarget\n");
>> + else
>> + appendPQExpBufferStr(&buf, " -1 AS stxstattarget\n");
>
> >= 130000 is fully determined by >= 14000 :)
>

Ah, right.

>> + * type of the opclass, which is not interesting for our purposes. (Note:
>> + * if we did anything with non-expression index columns, we'd need to
>
> index is wrong ?
>

Fixed

> I mentioned a bunch of other references to "index" and "predicate" which are
> still around:
>

Whooops, sorry. Fixed.

I'll post a cleaned-up version of the patch addressing Dean's review
comments too.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2021-03-24 16:42:26 Re: PoC/WIP: Extended statistics on expressions
Previous Message Bernd Helmle 2021-03-24 16:32:26 Re: postgres_fdw: IMPORT FOREIGN SCHEMA ... LIMIT TO (partition)