WIP: multivariate statistics / proof of concept

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: WIP: multivariate statistics / proof of concept
Date: 2014-10-12 22:00:53
Message-ID: 543AFA15.4080608@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi,

attached is a WIP patch implementing multivariate statistics. The code
certainly is not "ready" - parts of it look as if written by a rogue
chimp who got bored of attempts to type the complete works of William
Shakespeare, and decided to try something different.

I also cut some corners to make it work, and those limitations need to
be fixed before the eventual commit (those are not difficult problems,
but were not necessary for a proof-of-concept patch).

It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.

I expect to be busy over the next two weeks because of travel, so sorry
for somehow delayed responses. If you happen to attend pgconf.eu next
week (Oct 20-24), we can of course discuss this patch in person.

Goals and basics
----------------

The goal of this patch is allowing users to define multivariate
statistics (i.e. statistics on multiple columns), and improving
estimation when the columns are correlated.

Take for example a table like this:

CREATE TABLE test (a INT, b INT, c INT);
INSERT INTO test SELECT i/10000, i/10000, i/10000
FROM generate_series(1,1000000) s(i);
ANALYZE test;

and do a query like this:

SELECT * FROM test WHERE (a = 10) AND (b = 10) AND (c = 10);

which is estimated like this:

QUERY PLAN
---------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=1 width=12)
Filter: ((a = 10) AND (b = 10) AND (c = 10))
Planning time: 0.142 ms
(3 rows)

The query of course returns 10.000 rows, but the planner assumes the
columns are independent and thus multiplies the selectivities. And 1/100
for each column means 1/1000000 in total, which is 1 row.

This example is of course somehow artificial, but the problem is far
from uncommon, especially in denormalized datasets (e.g. star schema).
If you ever got an index scan instead of a sequential scan due to poor
estimate, resulting in a query running for hours instead of seconds, you
know the pain.

The patch allows you to do this:

ALTER TABLE test ADD STATISTICS ON (a, b, c);
ANALYZE test;

which then results in this estimate:

QUERY PLAN
------------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=9667 width=12)
Filter: ((a = 10) AND (b = 10) AND (c = 10))
Planning time: 0.110 ms
(3 rows)

This however is not free - both building such statistics (during
ANALYZE) and using it (during planning) costs some cycles. Even if we
optimize the hell out of it, it won't be entirely free.

One of the design goals in this patch is not to make the ANALYZE or
planning more expensive unless you add such statistics.

Those who add such statistics probably decided that the price is worth
the improved estimates, and lower risk of inefficient plans. If the
planning takes a few more miliseconds, it's probably worth it if you
risk queries running for minutes or hours because of misestimates.

It also does not guarantee the estimates to be always better. There will
be misestimates, although rather in the other direction (independence
assumption usually leads to underestimates, this may lead to
overestimates). However based on my experience from writing the patch I
be I believe it's possible to reasonably limit the extent of such errors
(just like in the single-column histograms, it's related to the bucket
size).

Of course, there will be cases when the old approach is lucky by
accident - there's not much we can do to beat luck. And we can't rely on
it either.

Design overview
---------------

The patch adds a new system catalog, called pg_mv_statistic, which is
used to keep track of requested statistics. There's also a pg_mv_stats
view, showing some basic info about the stats (not all the data).

There are three kinds of statistics

- list of most common combinations of values (MCV list)
- multi-dimensional histogram
- associative rules

The first two are extensions of the single-column stats we already have.
The MCV list is a trivial extension to multiple dimensions, just
tracking combinations and frequencies. The histogram is more complex -
the structure is quite simple (multi-dimensional rectangles) but there's
a lot of ways to build it. But even the current naive and simple
implementation seems to work quite well.

The last kind (associative rules) is an attempt to track "implications"
between columns. It is however an experiment and it's not really used in
the patch so I'll ignore it for now.

I'm not going to explain all the implementation details here - if you
want to learn more, the best way is probably by reading the changes in
those files (probably in this order):

src/include/utils/mvstats.h
src/backend/commands/analyze.c
src/backend/optimizer/path/clausesel.c

I tried to explain the ideas thoroughly in the comments, along with a
lot of TODO/FIXME items related to limitations, explained in the next
section.

Limitations
-----------

As I mentioned, the current patch has a number of practical limitations,
most importantly:

(a) only data types passed by value (no varlena types)
(b) only data types with sort (to be able to build histogram)
(c) no NULL values supported
(d) not handling DROP COLUMN or DROP TABLE and such
(e) limited to stats on 8 columns (max)
(f) optimizer uses single stats per table
(g) limited list of compatible WHERE clauses
(h) incomplete ADD STATISTICS syntax

The first three conditions are really a shortcut to a working patch, and
fixing them should not be difficult.

The limited number of columns is really just a sanity check. It's
possible to increase it, but I doubt stats on more columns will be
practical because of excessive size or poor accuracy.

A better approach is to support combining multiple stats, defined on
various subsets of columns. This is not implemented at the memoment, but
it's certainly on the roadmap. Currently the "smallest" stats covering
the most columns is selected.

Regarding the compatible WHERE clauses, the patch currently handles
conditions of the form

column OPERATOR constant

where operator is one of the comparison operators (=, <, >, =<, >=). In
the future it's possible to add support for more conditions, e.g.
"column IS NULL" or "column OPERATOR column".

The last point is really just "unfinished implementation" - the syntax I
propose is this:

ALTER TABLE ... ADD STATISTICS (options) ON (columns)

where the options influence the MCV list and histogram size, etc. The
options are recognized and may give you an idea of what it might do, but
it's not really used at the moment (except for storing in the
pg_mv_statistic catalog).

Examples
--------

Let's see a few examples of how to define the stats, and what difference
in estimates it makes:

CREATE TABLE test (a INT, b INT c INT);

-- same value in all columns
INSERT INTO test SELECT mod(i,100), mod(i,100), mod(i,100)
FROM generate_series(1,1000000) s(i);

ANALYZE test;

=============== no multivariate stats ============================

SELECT * FROM test WHERE a = 10 AND b = 10;

QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..20406.00 rows=101 width=12)
(actual time=0.007..60.902 rows=10000 loops=1)
Filter: ((a = 10) AND (b = 10))
Rows Removed by Filter: 990000
Planning time: 0.119 ms
Execution time: 61.164 ms
(5 rows)

SELECT * FROM test WHERE a = 10 AND b = 10 AND c = 10;

QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=1 width=12)
(actual time=0.010..56.780 rows=10000 loops=1)
Filter: ((a = 10) AND (b = 10) AND (c = 10))
Rows Removed by Filter: 990000
Planning time: 0.061 ms
Execution time: 56.994 ms
(5 rows)

=============== with multivariate stats ===========================

ALTER TABLE test ADD STATISTICS ON (a, b, c);
ANALYZE test;

SELECT * FROM test WHERE a = 10 AND b = 10;

QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..20406.00 rows=10767 width=12)
(actual time=0.007..58.981 rows=10000 loops=1)
Filter: ((a = 10) AND (b = 10))
Rows Removed by Filter: 990000
Planning time: 0.114 ms
Execution time: 59.214 ms
(5 rows)

SELECT * FROM test WHERE a = 10 AND b = 10 AND c = 10;

QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=10767 width=12)
(actual time=0.008..61.838 rows=10000 loops=1)
Filter: ((a = 10) AND (b = 10) AND (c = 10))
Rows Removed by Filter: 990000
Planning time: 0.088 ms
Execution time: 62.057 ms
(5 rows)

OK, that was rather significant improvement, but it's also trivial
dataset. Let's see something more complicated - the following table has
correlated columns with distributions skewed to 0.

CREATE TABLE test (a INT, b INT, c INT);
INSERT INTO test SELECT r*MOD(i,50),
pow(r,2)*MOD(i,100),
pow(r,4)*MOD(i,500)
FROM (SELECT random() AS r, i
FROM generate_series(1,1000000) s(i)) foo;
ANALYZE test;

SELECT * FROM test WHERE a = 0 AND b = 0;

=============== no multivariate stats ============================

QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..20406.00 rows=9024 width=12)
(actual time=0.007..62.969 rows=49503 loops=1)
Filter: ((a = 0) AND (b = 0))
Rows Removed by Filter: 950497
Planning time: 0.057 ms
Execution time: 64.098 ms
(5 rows)

SELECT * FROM test WHERE a = 0 AND b = 0 AND c = 0;

QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=2126 width=12)
(actual time=0.008..63.862 rows=40770 loops=1)
Filter: ((a = 0) AND (b = 0) AND (c = 0))
Rows Removed by Filter: 959230
Planning time: 0.060 ms
Execution time: 64.794 ms
(5 rows)

=============== with multivariate stats ============================

ALTER TABLE test ADD STATISTICS ON (a, b, c);
ANALYZE test;

db=> SELECT * FROM pg_mv_stats;
schemaname | public
tablename | test
attnums | 1 2 3
mcvbytes | 25904
mcvinfo | nitems=809
histbytes | 568240
histinfo | nbuckets=13772

SELECT * FROM test WHERE a = 0 AND b = 0;

QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..20406.00 rows=47717 width=12)
(actual time=0.007..61.782 rows=49503 loops=1)
Filter: ((a = 0) AND (b = 0))
Rows Removed by Filter: 950497
Planning time: 3.181 ms
Execution time: 62.859 ms
(5 rows)

SELECT * FROM test WHERE a = 0 AND b = 0 AND c = 0;

QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test (cost=0.00..22906.00 rows=40567 width=12)
(actual time=0.009..66.685 rows=40770 loops=1)
Filter: ((a = 0) AND (b = 0) AND (c = 0))
Rows Removed by Filter: 959230
Planning time: 0.188 ms
Execution time: 67.593 ms
(5 rows)

regards
Tomas

Attachment Content-Type Size
multivar-stats-v1.patch text/x-diff 139.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2014-10-13 03:45:19 Re: Add CREATE support to event triggers
Previous Message Gavin Flower 2014-10-12 20:09:36 Re: Column Redaction