Re: PoC/WIP: Extended statistics on expressions

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PoC/WIP: Extended statistics on expressions
Date: 2021-03-05 14:32:29
Message-ID: CAEZATCUo1xA4d5e=kXw5E0z7xKEi1oL_aLx=wBGT1A4vdc_gYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 4 Mar 2021 at 22:16, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> Attached is a slightly improved version of the patch series, addressing
> most of the issues raised in the previous message.

Cool. Sorry for the delay replying.

> 0003-Extended-statistics-on-expressions-20210304.patch
>
> Mostly unchanged, The one improvement is removing some duplicate code in
> in mvc.c.
>
> 0004-WIP-rework-tracking-of-expressions-20210304.patch
>
> This is mostly unchanged of the patch reworking how we assign artificial
> attnums to expressions (negative instead of (MaxHeapAttributeNumber+i)).

Looks good.

I see you undid the change to get_relation_statistics() in plancat.c,
which offset the attnums of plain attributes in the StatisticExtInfo
struct. I was going to suggest that as a simplification to the
previous 0004 patch. Related to that, is this comment in
dependencies_clauselist_selectivity():

/*
* Count matching attributes - we have to undo two attnum offsets.
* First, the dependency is offset using the number of expressions
* for that statistics, and then (if it's a plain attribute) we
* need to apply the same offset as above, by unique_exprs_cnt.
*/

which needs updating, since there is now just one attnum offset, not
two. Only the unique_exprs_cnt offset is relevant now.

Also, related to that change, I don't think that
stat_covers_attributes() is needed anymore. I think that the code that
calls it can now just be reverted back to using bms_is_subset(), since
that bitmapset holds plain attributes that aren't offset.

> 0005-WIP-unify-handling-of-attributes-and-expres-20210304.patch
>
> This reworks how we build statistics on attributes and expressions.
> Instead of treating attributes and expressions separately, this allows
> handling them uniformly.
>
> Until now, the various "build" functions (for different statistics
> kinds) extracted attribute values from sampled tuples, but expressions
> were pre-calculated in a separate array. Firstly to save CPU time (not
> having to evaluate expensive expressions repeatedly) and to keep the
> different stats consistent (there might be volatile functions etc.).
>
> So the build functions had to look at the attnum, determine if it's
> attribute or expression, and in some cases it was tricky / easy to get
> wrong.
>
> This patch replaces this "split" view with a simple "consistent"
> representation merging values from attributes and expressions, and just
> passes that to the build functions. There's no need to check the attnum,
> and handle expressions in some special way, so the build functions are
> much simpler / easier to understand (at least I think so).
>
> The build data is represented by "StatsBuildData" struct - not sure if
> there's a better name.
>
> I'm mostly happy with how this turned out. I'm sure there's a bit more
> cleanup needed (e.g. the merging/remapping of dependencies needs some
> refactoring, I think) but overall this seems reasonable.

Agreed. That's a nice improvement.

I wonder if dependency_is_compatible_expression() can be merged with
dependency_is_compatible_clause() to reduce code duplication. It
probably also ought to be possible to support "Expr IN Array" there,
in a similar way to the other code in statext_is_compatible_clause().
Also, should this check rinfo->clause_relids against the passed-in
relid to rule out clauses referencing other relations, in the same way
that statext_is_compatible_clause() does?

> I did some performance testing, I don't think there's any measurable
> performance degradation. I'm actually wondering if we need to transform
> the AttrNumber arrays into bitmaps in various places - maybe we should
> just do a plain linear search. We don't really expect many elements, as
> each statistics has 8 attnums at most. So maybe building the bitmapsets
> is a net loss? The one exception might be functional dependencies, where
> we can "merge" multiple statistics together. But even then it'd require
> many statistics objects to make a difference.

Possibly. There's a danger in trying to change too much at once
though. As it stands, I think it's fairly close to being committable,
with just a little more tidying up.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-03-05 14:48:43 Re: Which PG version does CVE-2021-20229 affected?
Previous Message David Steele 2021-03-05 14:27:05 Re: DROP INDEX CONCURRENTLY on partitioned index