Re: WIP: multivariate statistics / proof of concept

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: multivariate statistics / proof of concept
Date: 2015-01-24 20:21:39
Message-ID: 54C3FED3.1060600@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi,

attached is an updated version of the multivariate stats patch. This is
going to be a bit longer mail, so I'll put here a small ToC ;-)

1) patch split into 4 parts
2) where to start / documentation
3) state of the code
4) main changes/improvements
5) remaining limitations

The motivation and design ideas, explained in the first message of this
thread are still valid. It might be a good idea to read it first:

http://www.postgresql.org/message-id/flat/543AFA15(dot)4080608(at)fuzzy(dot)cz

BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro
into the patch in person, or discuss the patch in general.

1) Patch split into 4 parts
---------------------------
Firstly, the patch got broken into the following four pieces, to make
the reviews somewhat easier:

1) 0001-shared-infrastructure-and-functional-dependencies.patch

- infrastructure, shared by all the kinds of stats added
in the following patches (catalog, ALTER TABLE, ANALYZE ...)

- implementation of a simple statistics, tracking functional
dependencies between columns (previously called "associative
rules", but that's incorrect for several reasons)

- this does not modify the optimizer in any way

2) 0002-clause-reduction-using-functional-dependencies.patch

- applies the functional dependencies to optimizer (i.e. considers
the rules in clauselist_selectivity())

3) 0003-multivariate-MCV-lists.patch

- multivariate MCV lists (both ANALYZE and optimizer parts)

4) 0004-multivariate-histograms.patch

- multivariate histograms (both ANALYZE and optimizer parts)

You may look at the patches at github here:

https://github.com/tvondra/postgres/tree/multivariate-stats-squashed

The branch is not stable, i.e. I'll rebase / squash / force-push changes
in the future. (There's also multivariate-stats development branch with
unsquashed changes, but you don't want to look at that, trust me.)

The patches are not exactly small (being in the 50-100 kB range), but
that's mostly because of the amount of comments explaining the goals and
implementation details.

2) Where to start / documentation
---------------------------------
I strived to document all the pieces properly, mostly in the form of
comments. There's no sgml documentation at this point, which should
obviously change in the future.

Anyway, I'd suggest reading the first e-mail in this thread, explaining
the ideas, and then these comments:

1) functional dependencies (patch 0001)
- src/backend/utils/mvstats/dependencies.c

2) MCV lists (patch 0003)
- src/backend/utils/mvstats/mcv.c

3) histograms (patch 0004)
- src/backend/utils/mvstats/mcv.c

- also see clauselist_mv_selectivity_mcvlist() in clausesel.c
- also see clauselist_mv_selectivity_histogram() in clausesel.c

4) selectivity estimation (patches 0002-0004)
- all in src/backend/optimizer/path/clausesel.c
- clauselist_selectivity() - overview of how the stats are applied
- clauselist_apply_dependencies() - functional dependencies reduction
- clauselist_mv_selectivity_mcvlist() - MCV list estimation
- clauselist_mv_selectivity_histogram() - histogram estimation

3) State of the code
--------------------
I've spent a fair amount of time testing the patches, and while I
believe there are no segfaults or so, I know parts of the code need a
bit more love.

The part most in need of improvements / comments is probably the code in
clausesel.c - that seems a bit quirky. Reviews / comments regarding this
part of the code are very welcome - I'm sure there are many ways to
improve this part.

There are a few FIXMEs elsewhere (e.g. about memory allocation in the
(de)serialization code), but those are mostly well-defined issues that I
know how to address (at least I believe so).

4) Main changes/improvements
----------------------------
There are many significant improvements. The previous patch version was
in the 'proof of concept' category (missing pieces, knowingly broken in
some areas), the current patch should 'mostly work'.

The patch fixes two most annoying limitations of the first version:

(a) support for all data types (not just those passed by value)
(b) handles NULL values properly
(c) adds support for IS [NOT] NULL clauses

Aside from that the code was significantly improved, there are proper
regression tests and plenty of comments explaining the details.

5) Remaining limitations
------------------------

(a) limited to stats on 8 columns

This is mostly just a 'safeguard' restriction.

(b) only data types with '<' operator

I don't think this will change anytime soon, because all the
algorithms for building the stats rely on this. I don't see
this as a serious limitation though.

(c) not handling DROP COLUMN or DROP TABLE and so on

Currently this is not handled at all (so the regression tests
do an explicit DELETE from the pg_mv_statistic catalog).

Handling the DROP TABLE won't be difficult, it's similar to the
current stats. Handling ALTER TABLE ... DROP COLUMN will be much
more tricky I guess - should we drop all the stats referencing
that column, or should we just remove it from the stats? Or
should we keep it and treat it as NULL? Not sure what's the best
solution.

(d) limited list of compatible WHERE clauses

The initial patch handled only simple operator clauses

(Var op Constant)

where operator is one of ('<', '<=', '=', '>=', '>'). Now it also
handles IS [NOT] NULL clauses. Adding more clause types should
not be overly difficult - starting with more traditional
'BooleanTest' conditions, or even multi-column conditions
(Var op Var)

which are difficult to estimate using simple-column stats.

(e) optimizer uses single stats per table

This is still true and I don't think this will change soon. i do
have some ideas on how to merge multiple stats etc. but it's
certainly complex stuff, unlikely to happen within this CF. The
patch makes a lot of sense even without this particular feature,
because you can create multiple stats, each suitable for different
queries.

(f) no JOIN conditions

Similarly to the previous point, it's on the TODO but it's not
going to happen in this CF.

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-shared-infrastructure-and-functional-dependencies.patch text/x-diff 59.8 KB
0002-clause-reduction-using-functional-dependencies.patch text/x-diff 53.4 KB
0003-multivariate-MCV-lists.patch text/x-diff 100.6 KB
0004-multivariate-histograms.patch text/x-diff 114.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-01-24 21:31:14 Shortcoming in CLOBBER_FREED_MEMORY coverage: disk buffer pointers
Previous Message Tom Lane 2015-01-24 16:59:37 Re: numeric access out of bounds