Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2017-01-03 21:55:01
Message-ID: 96973208-28b2-0444-ddfd-45a1f63911b6@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/30/2016 02:12 PM, Petr Jelinek wrote:
> On 12/12/16 22:50, Tomas Vondra wrote:
>> On 12/12/2016 12:26 PM, Amit Langote wrote:
>>>
>>> Hi Tomas,
>>>
>>> On 2016/10/30 4:23, Tomas Vondra wrote:
>>>> Hi,
>>>>
>>>> Attached is v20 of the multivariate statistics patch series, doing
>>>> mostly
>>>> the changes outlined in the preceding e-mail from October 11.
>>>>
>>>> The patch series currently has these parts:
>>>>
>>>> * 0001 : (FIX) teach pull_varno about RestrictInfo
>>>> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients
>
> Hi,
>
> I went over these two (IMHO those could easily be considered as minimal
> committable set even if the user visible functionality they provide is
> rather limited).
>

Yes, although I still have my doubts 0001 is the right way to make
pull_varnos work. It's probably related to the bigger design question,
because moving the statistics selection to an earlier phase could make
it unnecessary I guess.

>> dropping statistics
>> -------------------
>>
>> The statistics may be dropped automatically using DROP STATISTICS.
>>
>> After ALTER TABLE ... DROP COLUMN, statistics referencing are:
>>
>> (a) dropped, if the statistics would reference only one column
>>
>> (b) retained, but modified on the next ANALYZE
>
> This should be documented in user visible form if you plan to keep it
> (it does make sense to me).
>

Yes, I plan to keep it. I agree it should be documented, probably on the
ALTER TABLE page (and linked from CREATE/DROP statistics pages).

>> + therefore perfectly correlated. Providing additional information about
>> + correlation between columns is the purpose of multivariate statistics,
>> + and the rest of this section thoroughly explains how the planner
>> + leverages them to improve estimates.
>> + </para>
>> +
>> + <para>
>> + For additional details about multivariate statistics, see
>> + <filename>src/backend/utils/mvstats/README.stats</>. There are additional
>> + <literal>READMEs</> for each type of statistics, mentioned in the following
>> + sections.
>> + </para>
>> +
>> + </sect1>
>
> I don't think this qualifies as "thoroughly explains" ;)
>

OK, I'll drop the "thoroughly" ;-)

>> +
>> +Oid
>> +get_statistics_oid(List *names, bool missing_ok)
>
> No comment?
>
>> + case OBJECT_STATISTICS:
>> + msg = gettext_noop("statistics \"%s\" does not exist, skipping");
>> + name = NameListToString(objname);
>> + break;
>
> This sounds somewhat weird (plural vs singular).
>

Ah, right - it should be either "statistic ... does not" or "statistics
... do not". I think "statistics" is the right choice here, because (a)
we have CREATE STATISTICS and (b) it may be a combination of statistics,
e.g. histogram + MCV.

>> + * XXX Maybe this should check for duplicate stats. Although it's not clear
>> + * what "duplicate" would mean here (wheter to compare only keys or also
>> + * options). Moreover, we don't do such checks for indexes, although those
>> + * store tuples and recreating a new index may be a way to fix bloat (which
>> + * is a problem statistics don't have).
>> + */
>> +ObjectAddress
>> +CreateStatistics(CreateStatsStmt *stmt)
>
> I don't think we should check duplicates TBH so I would remove the XXX
> (also "wheter" is typo but if you remove that paragraph it does not matter).
>

Yes, I came to the same conclusion - we can only really check for exact
matches (same set of columns, same choice of statistic types), but
that's fairly useless. I'll remove the XXX.

>> + if (true)
>> + {
>
> Huh?
>

Yeah, that's a bit weird pattern. It's a remainder of copy-pasting the
preceding block, which looks like this

if (hasindex)
{
...
}

But we've decided to not add similar flag for the statistics. I'll move
the block to a separate function (instead of merging it directly into
the function, which is already a bit largeish).

>> +
>> +List *
>> +RelationGetMVStatList(Relation relation)
>> +{
> ...
>> +
>> +void
>> +update_mv_stats(Oid mvoid, MVNDistinct ndistinct,
>> + int2vector *attrs, VacAttrStats **stats)
> ...
>> +static double
>> +ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows,
>> + int2vector *attrs, VacAttrStats **stats,
>> + int k, int *combination)
>> +{
>
>
> Again, these deserve comment.
>

OK, will add.

> I'll try to look at other patches in the series as time permits.

thanks

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-01-03 22:02:18 Re: Shrink volume of default make output
Previous Message Peter Eisentraut 2017-01-03 21:51:04 Re: Logical Replication WIP