Re: extended statistics: n-distinct

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: alvherre(at)2ndquadrant(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, tomas(dot)vondra(at)2ndquadrant(dot)com, dean(dot)a(dot)rasheed(at)gmail(dot)com
Subject: Re: extended statistics: n-distinct
Date: 2017-03-21 09:00:24
Message-ID: 20170321.180024.50058401.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for finishing this.

At Mon, 20 Mar 2017 16:02:20 -0300, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in <20170320190220(dot)ixlaueanxegqd5gr(at)alvherre(dot)pgsql>
> Here is a closer to final version of the multivariate statistics series,
> last posted at
> https://www.postgresql.org/message-id/20170316222033.ncdi7nidah2gdzjx%40alvherre.pgsql

I'm sorry but this seems conflicting the current master(17fa3e8)
and a3eac988c267, which is the base of v28.

> If you've always wanted to review multivariate stats, but never found a
> good reason to, now is a terrific time to do so! (In other words: I
> plan to get this pushed in the not too distant future.)

Great! But sorry for having contributed not so much.

> This is a new thread to present a version of the n-distinct patch that
> IMO is close enough to commit. There are some work items still.
> There's some discussion on the topic of cross-column statistics:
> https://wiki.postgresql.org/wiki/Cross_Columns_Stats
>
> This problem is important enough that Kyotaro Horiguchi submitted
> another patch that does the same thing:
> https://www.postgresql.org/message-id/flat/20150828.173334.114731693.horiguchi.kyotaro%40lab.ntt.co.jp
> This patch aims to provide the same functionality, keeping the design
> general enough that other kinds of statistics can be added later (such
> as functional dependencies, histograms and MCVs, all of which have been
> previously submitted as patches by Tomas).

I may be stupid but I don't get the picture here, specifically
about the relation to Tomas's patch. Does this work as
infrastructure for Tomas's mv patch? Or in some other
relationsip?

> To recap, what this patch provides is a new command of the form
> CREATE STATISTICS statname [WITH (opts)] ON (columns) FROM table
>
> Note that we put the table name in a separate FROM clause instead of
> together with the column name, so that this is more readily extensible
> to things that are not just columns, for example expressions that might
> involve more than one table (per review from Dean Rasheed). Currently,
> only one table is supported.
>
> In this patch, the "opts" can only be "ndistinct", which creates a
> pg_statistic_ext row with the number of distinct groups found in all
> possible combination across that set of columns. This can be used when
> a GROUP BY or a DISTINCT clause need to estimate the number of distinct
> groups in an aggregation.
>
>
>
> Some things left to change:
>
> * Currently, we use the ndistinct value only if the grouping uses
> exactly the set of columns covered by a statistics. For example, if we
> have stats on (a,b,c) and the grouping is on (a,b,c,d), we fall back to
> the old method, which may result in worse results than if we used the
> number we know about (a,b,c) then applied a fixup to consider the
> distinctness of (d).

Do you planning to realize correcting esitimation of joins
perplexed by strong correlations?

> * Also, estimate_num_groups() looks a bit patchy. With slightly more
> invasive changes we can make it look more natural.
>
> * I'm not terribly happy with the header organization. I think
> VacAttrStats should be in its own (new) src/include/statistics/analyze.h
> for example (which cleans up a bunch of existing stuff a bit), and the
> new files could do with some slight makeover.
>
> * The current code uses AttrNumber * and int2vector, in places where it
> would be more convenient to use Bitmapsets.
>
> * We currently try to keep a stats object even if a column in it is
> dropped -- for example, if we have stats on (a,b,c) and drop (b), then
> we still have stats on (a,c). While this is nice, it creates a bunch of
> weird corner cases, so I'm going to rip that out and just drop the
> statistics instead. If the user wants stats on (a,c) to remain, they
> can create it after (or before) dropping the column.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-03-21 09:05:42 Re: Partitioned tables and relfilenode
Previous Message Ideriha, Takeshi 2017-03-21 08:59:25 Re: Other formats in pset like markdown, rst, mediawiki