Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2016-10-29 19:23:34
Message-ID: 61e71067-9461-d785-b4a6-6e8a08996d5f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached is v20 of the multivariate statistics patch series, doing
mostly the changes outlined in the preceding e-mail from October 11.

The patch series currently has these parts:

* 0001 : (FIX) teach pull_varno about RestrictInfo
* 0002 : (PATCH) shared infrastructure and ndistinct coefficients
* 0003 : (PATCH) functional dependencies (only the ANALYZE part)
* 0004 : (PATCH) selectivity estimation using functional dependencies
* 0005 : (PATCH) multivariate MCV lists
* 0006 : (PATCH) multivariate histograms
* 0007 : (WIP) selectivity estimation using ndistinct coefficients
* 0008 : (WIP) use multiple statistics for estimation
* 0009 : (WIP) psql tab completion basics

Let me elaborate about the main changes in this version:

1) rework CREATE STATISTICS to what Dean Rasheed proposed in [1]:
-----------------------------------------------------------------------

CREATE STATISTICS name WITH (options) ON (columns) FROM table

This allows adding support for statistics on joins, expressions
referencing multiple tables, and partial statistics (with WHERE
predicates, similar to indexes). Although those things are not
implemented (and I don't know if/when that happens), it's good the
syntax supports it.

I've been thinking about using "CREATE STATISTIC" instead, but I decided
to stick with "STATISTICS" for two reasons. Firstly it's possible to
create multiple statistics in a single command, for example by using
WITH (mcv,histogram). And secondly, we already hava "ALTER TABLE ... SET
STATISTICS n" (although that tweaks the statistics target for a column,
not the statistics on the column).

2) no changes to catalog names
-----------------------------------------------------------------------

Clearly, naming things is one of the hardest things in computer science.
I don't have a good idea what names would be better than the current
ones. In any case, this is fairly trivial to do.

3) special data types for statistics
-----------------------------------------------------------------------

Heikki proposed to invent a new data type, similar to pg_node_tree. I do
agree that storing the stats in plain bytea (i.e. catalog having bytea
columns) was not particularly convenient, but I'm not sure how much of
pg_node_tree Heikki wanted to copy.

In particular, I'm not sure whether Heikki's idea was store all the
statistics together in a single Datum, serialized into a text string
(similar to pg_node_tree).

I don't think that would be a good idea, as the statistics may be quite
large and complex, and deserializing them from text format would be
quite expensive. For pg_node_tree that's not a major issue because the
values are usually fairly small. Similarly, packing everything into a
single datum would force the planner to parse/unpack everything, even if
it needs just a small piece (e.g. the ndistinct coefficients, but not
histograms).

So I've decided to invent new data types, one for each statistic type:

* pg_ndistinct
* pg_dependencies
* pg_mcv_list
* pg_histogram

Similarly to pg_node_tree those data types only support output, i.e.
both 'recv' and 'in' functions do elog(ERROR). But while pg_node_tree is
stored as text, those new data types are still bytea.

I do believe this is a good solution, and it allows casting the data
types to text easily, as it simply calls the out function.

The statistics however do not store attnums in the bytea, just indexes
into pg_mv_statistic.stakeys. That means the out functions can't print
column names in the output, or values (because without the attnum we
don't know the type, and thus can't lookup the proper out function).

I don't think there's a good solution for that (I was thinking about
storing the attnums/typeoid in the statistics itself, but that seems
fairly ugly). And I'm quite happy with those new data types.

4) replace functional dependencies with ndistinct (in the first patch)
-----------------------------------------------------------------------

As the ndistinct coeffients are simpler than functional dependencies,
I've decided to use them in the fist patch in the series, which
implements the shared infrastructure. This does not mean throwing away
functional dependencies entirely, just moving them to a later patch.

5) rework of ndistinct coefficients
-----------------------------------------------------------------------

The ndistinct coefficients were also significantly reworked. Instead of
computing and storing the value for the exact combination of attributes,
the new version computes ndistinct for all combinations of attributes.

So for example with CREATE STATISTICS x ON (a,b,c) the old patch only
computed ndistinct on (a,b,c), while the new patch computes ndistinct on
{(a,b,c), (a,b), (a,c), (b,c)}. This makes it way more powerful.

The first patch (0002) only uses this in estimate_num_groups to improve
GROUP BY estimates. A later patch (0007) shows how it might be used for
selectivity estimation, but it's a very early WIP at this point.

Also, I'm not sure we should use ndistinct coefficients this way,
because of the "homogenity" assumption, similarly to functional
dependencies. Functional dependencies are used only for selectivity
estimation, so it's quite easy not to use them if they don't work for
that purpose. But ndistinct coefficients are also used for GROUP BY
estimation, where the homogenity assumption is not such a big deal. So I
expect people to add ndistinct, get better GROUP BY estimates but
sometimes worse selectivity estimates - not great, I guess.

But the selectivity estimation using ndistinct coefficients is very
simple right now - in particular it does not use the per-clause
selectivities at all, it simply assumes the whole selectivity is
1/ndistinct for the combination of columns.

Functional dependencies use this formula to combine the selectivities:

P(a,b) = P(a) * [f + (1-f)*P(b)]

so maybe there's something similar for ndistinct coefficients? I mean,
let's say we know ndistinc(a), ndistinct(b), ndistinct(a,b) and P(a)
and P(b). How do we compute P(a,b)?

5) rework functional dependencies
-----------------------------------------------------------------------

Based on Dean's feedback, I've reworked functional dependencies to use
continuous "degree" of validity (instead of true/false behavior,
resulting in sudden changes in behavior).

This significantly reduced the amount of code, because the old patch
tried to identify transitive dependencies (to minimize time and storage
requirements). Switching to continuous degree makes this impossible (or
at least far more complicated), so I've simply ripped all of this out.

This means the statistics will be larger and ANALYZE will take more
time, the differences are fairly small in practice, and the estimation
actually seems to work better.

6) MCV and histogram changes
-----------------------------------------------------------------------

Those statistics types are mostly unchanged, except for a few minor bug
fixes and removal of remove max_mcv_items and max_buckets options.

Those options were meant to allow users to limit the size of the
statistics, but the implementation was ignoring them so far. So I've
ripped them out, and if needed we may reintroduce them later.

7) no more (elaborate) combinations of statistics
-----------------------------------------------------------------------

I've ripped out the patch that combined multiple statistics in very
elaborate way - it was overly complex, possibly wrong, but most
importantly it distracted people from the preceding patches. So I've
ripped this out, and instead replaced that with a very simple approach
that allows using multiple statistics on different subsets if the clause
list. So for example

WHERE (a=1) AND (b=1) AND (c=1) AND (d=1)

may benefit from two statistics, one on (a,b) and second on (c,d). It's
very simple approach, but it does the trick for many cases and is better
than "single statistics" limitation.

The 0008 patch is actually very simple, essentially adding just a loop
into the code blocks, so I think it's quite likely this will get merged
into the preceding patches.

8) reduce table sizes used in regression tests
-----------------------------------------------------------------------

Some of the regression tests used quite large tables (with up to 1M
rows), which had two issues - long runtimes and unstability (because the
ANALYZE sample is only 30k rows, so there were sometimes small changes
due to picking a different sample). I've limited the table sizes to 30k
rows.

8) open / unsolved questions
-----------------------------------------------------------------------

The main open question is still whether clausesel.c is the best place to
do all the heavy lifting (particularly matching clauses and statistics,
and deciding which statistics to use). I suspect some of that should be
done elsewhere (earlier in the planning), enriching the query tree
somehow. Then clausesel.c would "only" compute the estimates, and it
would also allow showing the info in EXPLAIN.

I'm not particularly happy with the changes in claselist_selectivity
look right now - there are three almost identical blocks, so this would
deserve some refactoring. But I'd like to get some feedback first.

regards

[1]
https://www.postgresql.org/message-id/CAEZATCUtGR+U5+QTwjHhe9rLG2nguEysHQ5NaqcK=VbJ78VQFA@mail.gmail.com

[2]
https://www.postgresql.org/message-id/1c7e4e63-769b-f8ce-f245-85ef4f59fcba%40iki.fi

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
multivariate-stats-v20.tgz application/x-compressed-tar 140.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-10-29 19:23:45 Re: Speed of user-defined aggregates using array_append as transfn
Previous Message Michael Paquier 2016-10-29 13:12:27 Re: Streaming basebackups vs pg_stat_tmp