Re: multivariate statistics (v25)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: david(at)fetter(dot)org, dean(dot)a(dot)rasheed(at)gmail(dot)com, alvherre(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics (v25)
Date: 2017-03-02 14:52:51
Message-ID: 7dedc58b-f9cf-fe34-ef7f-382f776ce953@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/02/2017 07:42 AM, Kyotaro HORIGUCHI wrote:
> Hello,
>
> At Thu, 2 Mar 2017 04:05:34 +0100, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote in <a78ffb17-70e8-a55a-c10c-66ab575e88ed(at)2ndquadrant(dot)com>
>> OK,
>>
>> attached is v24 of the patch series, addressing most of the reported
>> issues and comments (at least I believe so). The main changes are:
>
> Unfortunately, 0002 conflicts with the current master
> (4461a9b). Could you rebase them or tell us the commit where this
> patches stand on?
>

Attached is a rebased patch series, otherwise it's the same as v24.

FWIW it was based on 016c990834 from Feb 28, but apparently some recent
patch caused a minor conflict.

> I only saw the patch files but have some comments.
>
>> 1) I've mostly abandoned the "multivariate" name in favor of
>> "extended", particularly in places referring to stats stored in the
>> pg_statistic_ext in general. "Multivariate" is now used only in places
>> talking about particular types (e.g. multivariate histograms).
>>
>> The "extended" name is more widely used for this type of statistics,
>> and the assumption is that we'll also add other (non-multivariate)
>> types of statistics - e.g. statistics on custom expressions, or some
>> for of join statistics.
>
> In 0005, and
>
> @@ -184,14 +208,43 @@ clauselist_selectivity(PlannerInfo *root,
> * If there are no such stats or not enough attributes, don't waste time
> * simply skip to estimation using the plain per-column stats.
> */
> + if (has_stats(stats, STATS_TYPE_MCV) &&
> ...
> + /* compute the multivariate stats */
> + s1 *= clauselist_ext_selectivity(root, mvclauses, stat);
> ====
> @@ -1080,10 +1136,71 @@ clauselist_ext_selectivity_deps(PlannerInfo *root, Index relid,
> }
>
> /*
> + * estimate selectivity of clauses using multivariate statistic
>
> These comment is left unchanged? or on purpose? 0007 adds very
> similar texts.
>

Hmm, those comments should be probably changed to "extended".

>> 2) Catalog pg_mv_statistic was renamed to pg_statistic_ext (and
>> pg_mv_stats view renamed to pg_stats_ext).
>
> FWIW, "extended statistic" would be abbreviated as
> "ext_statistic" or "extended_stats". Why have you exchanged the
> words?
>

Because this way it's clear it's a version of pg_statistic, and it will
be sorted right next to it.

>> 3) The structure of pg_statistic_ext was changed as proposed by
>> Alvaro, i.e. the boolean flags were removed and instead we have just a
>> single "char[]" column with list of enabled statistics.
>>
>> 4) I also got rid of the "mv" part in most variable/function/constant
>> names, replacing it by "ext" or something similar. Also mvstats.h got
>> renamed to stats.h.
>>
>> 5) Moved the files from src/backend/utils/mvstats to
>> backend/statistics.
>>
>> 6) Fixed the n_choose_k() overflow issues by using the algorithm
>> proposed by Dean. Also, use the simple formula for num_combinations().
>>
>> 7) I've tweaked data types for a few struct members (in stats.h). I've
>> kept most of the uint32 fields at the top level though, because int16
>> might not be large enough for large statistics and the overhead is
>> minimal (compared to the space needed e.g. for histogram buckets).
>
> Some formulated proof or boundary value test cases might be
> needed (to prevent future trouble). Or any defined behavior on
> overflow of them might be enough. I belive all (or most) of
> overflow-able data has such behavior.
>

That is probably a good idea and I plan to do that.

>> The renames/changes were quite widespread, but I've done my best to
>> fix all the comments and various other places.
>>
>> regards
>
> regards,
>

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2017-03-02 14:53:45 Re: multivariate statistics (v25)
Previous Message David Steele 2017-03-02 14:43:06 Re: tuplesort_gettuple_common() and *should_free argument