extended statistics: n-distinct

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: extended statistics: n-distinct
Date: 2017-03-20 19:02:20
Message-ID: 20170320190220.ixlaueanxegqd5gr@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Here is a closer to final version of the multivariate statistics series,
last posted at
If you've always wanted to review multivariate stats, but never found a
good reason to, now is a terrific time to do so! (In other words: I
plan to get this pushed in the not too distant future.)

This is a new thread to present a version of the n-distinct patch that
IMO is close enough to commit. There are some work items still.
There's some discussion on the topic of cross-column statistics:

This problem is important enough that Kyotaro Horiguchi submitted
another patch that does the same thing:
This patch aims to provide the same functionality, keeping the design
general enough that other kinds of statistics can be added later (such
as functional dependencies, histograms and MCVs, all of which have been
previously submitted as patches by Tomas).

To recap, what this patch provides is a new command of the form
CREATE STATISTICS statname [WITH (opts)] ON (columns) FROM table

Note that we put the table name in a separate FROM clause instead of
together with the column name, so that this is more readily extensible
to things that are not just columns, for example expressions that might
involve more than one table (per review from Dean Rasheed). Currently,
only one table is supported.

In this patch, the "opts" can only be "ndistinct", which creates a
pg_statistic_ext row with the number of distinct groups found in all
possible combination across that set of columns. This can be used when
a GROUP BY or a DISTINCT clause need to estimate the number of distinct
groups in an aggregation.

Some things left to change:

* Currently, we use the ndistinct value only if the grouping uses
exactly the set of columns covered by a statistics. For example, if we
have stats on (a,b,c) and the grouping is on (a,b,c,d), we fall back to
the old method, which may result in worse results than if we used the
number we know about (a,b,c) then applied a fixup to consider the
distinctness of (d).

* Also, estimate_num_groups() looks a bit patchy. With slightly more
invasive changes we can make it look more natural.

* I'm not terribly happy with the header organization. I think
VacAttrStats should be in its own (new) src/include/statistics/analyze.h
for example (which cleans up a bunch of existing stuff a bit), and the
new files could do with some slight makeover.

* The current code uses AttrNumber * and int2vector, in places where it
would be more convenient to use Bitmapsets.

* We currently try to keep a stats object even if a column in it is
dropped -- for example, if we have stats on (a,b,c) and drop (b), then
we still have stats on (a,c). While this is nice, it creates a bunch of
weird corner cases, so I'm going to rip that out and just drop the
statistics instead. If the user wants stats on (a,c) to remain, they
can create it after (or before) dropping the column.

Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
extstat-ndistinct-29.patch text/plain 168.1 KB


Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2017-03-20 19:10:25 Re: logical replication access control patches
Previous Message Beena Emerson 2017-03-20 18:07:39 Re: increasing the default WAL segment size