Re: WIP: multivariate statistics / proof of concept

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: tomas(dot)vondra(at)2ndquadrant(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: multivariate statistics / proof of concept
Date: 2015-03-20 08:33:07
Message-ID: 20150320.173307.249113775.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Patch 0001 needs changes for OIDs since my patch was
committed. The attached is compatible with current master.

And I tried this like this, and got the following error on
analyze. But unfortunately I don't have enough time to
investigate it now.

postgres=# create table t1 (a int, b int, c int);
insert into t1 (select a/ 10000, a / 10000, a / 10000 from generate_series(0, 99999) a);
postgres=# analyze t1;
ERROR: invalid memory alloc request size 1485176862

regards,

At Sat, 24 Jan 2015 21:21:39 +0100, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote in <54C3FED3(dot)1060600(at)2ndquadrant(dot)com>
> Hi,
>
> attached is an updated version of the multivariate stats patch. This is
> going to be a bit longer mail, so I'll put here a small ToC ;-)
>
> 1) patch split into 4 parts
> 2) where to start / documentation
> 3) state of the code
> 4) main changes/improvements
> 5) remaining limitations
>
> The motivation and design ideas, explained in the first message of this
> thread are still valid. It might be a good idea to read it first:
>
> http://www.postgresql.org/message-id/flat/543AFA15(dot)4080608(at)fuzzy(dot)cz
>
> BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro
> into the patch in person, or discuss the patch in general.
>
>
> 1) Patch split into 4 parts
> ---------------------------
> Firstly, the patch got broken into the following four pieces, to make
> the reviews somewhat easier:
>
> 1) 0001-shared-infrastructure-and-functional-dependencies.patch
>
> - infrastructure, shared by all the kinds of stats added
> in the following patches (catalog, ALTER TABLE, ANALYZE ...)
>
> - implementation of a simple statistics, tracking functional
> dependencies between columns (previously called "associative
> rules", but that's incorrect for several reasons)
>
> - this does not modify the optimizer in any way
> 2) 0002-clause-reduction-using-functional-dependencies.patch
>
> - applies the functional dependencies to optimizer (i.e. considers
> the rules in clauselist_selectivity())
>
> 3) 0003-multivariate-MCV-lists.patch
>
> - multivariate MCV lists (both ANALYZE and optimizer parts)
>
> 4) 0004-multivariate-histograms.patch
>
> - multivariate histograms (both ANALYZE and optimizer parts)
>
>
> You may look at the patches at github here:
>
> https://github.com/tvondra/postgres/tree/multivariate-stats-squashed
>
> The branch is not stable, i.e. I'll rebase / squash / force-push changes
> in the future. (There's also multivariate-stats development branch with
> unsquashed changes, but you don't want to look at that, trust me.)
>
> The patches are not exactly small (being in the 50-100 kB range), but
> that's mostly because of the amount of comments explaining the goals and
> implementation details.
>
>
> 2) Where to start / documentation
> ---------------------------------
> I strived to document all the pieces properly, mostly in the form of
> comments. There's no sgml documentation at this point, which should
> obviously change in the future.
>
> Anyway, I'd suggest reading the first e-mail in this thread, explaining
> the ideas, and then these comments:
>
> 1) functional dependencies (patch 0001)
> - src/backend/utils/mvstats/dependencies.c
>
> 2) MCV lists (patch 0003)
> - src/backend/utils/mvstats/mcv.c
>
> 3) histograms (patch 0004)
> - src/backend/utils/mvstats/mcv.c
>
> - also see clauselist_mv_selectivity_mcvlist() in clausesel.c
> - also see clauselist_mv_selectivity_histogram() in clausesel.c
>
> 4) selectivity estimation (patches 0002-0004)
> - all in src/backend/optimizer/path/clausesel.c
> - clauselist_selectivity() - overview of how the stats are applied
> - clauselist_apply_dependencies() - functional dependencies reduction
> - clauselist_mv_selectivity_mcvlist() - MCV list estimation
> - clauselist_mv_selectivity_histogram() - histogram estimation
>
>
> 3) State of the code
> --------------------
> I've spent a fair amount of time testing the patches, and while I
> believe there are no segfaults or so, I know parts of the code need a
> bit more love.
>
> The part most in need of improvements / comments is probably the code in
> clausesel.c - that seems a bit quirky. Reviews / comments regarding this
> part of the code are very welcome - I'm sure there are many ways to
> improve this part.
>
> There are a few FIXMEs elsewhere (e.g. about memory allocation in the
> (de)serialization code), but those are mostly well-defined issues that I
> know how to address (at least I believe so).
>
>
> 4) Main changes/improvements
> ----------------------------
> There are many significant improvements. The previous patch version was
> in the 'proof of concept' category (missing pieces, knowingly broken in
> some areas), the current patch should 'mostly work'.
>
> The patch fixes two most annoying limitations of the first version:
>
> (a) support for all data types (not just those passed by value)
> (b) handles NULL values properly
> (c) adds support for IS [NOT] NULL clauses
>
> Aside from that the code was significantly improved, there are proper
> regression tests and plenty of comments explaining the details.
>
>
> 5) Remaining limitations
> ------------------------
>
> (a) limited to stats on 8 columns
>
> This is mostly just a 'safeguard' restriction.
>
> (b) only data types with '<' operator
>
> I don't think this will change anytime soon, because all the
> algorithms for building the stats rely on this. I don't see
> this as a serious limitation though.
>
> (c) not handling DROP COLUMN or DROP TABLE and so on
>
> Currently this is not handled at all (so the regression tests
> do an explicit DELETE from the pg_mv_statistic catalog).
>
> Handling the DROP TABLE won't be difficult, it's similar to the
> current stats. Handling ALTER TABLE ... DROP COLUMN will be much
> more tricky I guess - should we drop all the stats referencing
> that column, or should we just remove it from the stats? Or
> should we keep it and treat it as NULL? Not sure what's the best
> solution.
>
> (d) limited list of compatible WHERE clauses
>
> The initial patch handled only simple operator clauses
>
> (Var op Constant)
>
> where operator is one of ('<', '<=', '=', '>=', '>'). Now it also
> handles IS [NOT] NULL clauses. Adding more clause types should
> not be overly difficult - starting with more traditional
> 'BooleanTest' conditions, or even multi-column conditions
> (Var op Var)
>
> which are difficult to estimate using simple-column stats.
>
> (e) optimizer uses single stats per table
>
> This is still true and I don't think this will change soon. i do
> have some ideas on how to merge multiple stats etc. but it's
> certainly complex stuff, unlikely to happen within this CF. The
> patch makes a lot of sense even without this particular feature,
> because you can create multiple stats, each suitable for different
> queries.
>
> (f) no JOIN conditions
>
> Similarly to the previous point, it's on the TODO but it's not
> going to happen in this CF.
>
>
> kind regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-shared-infrastructure-and-functional-dependencies.patch text/x-patch 59.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-03-20 09:32:06 Re: Using 128-bit integers for sum, avg and statistics aggregates
Previous Message David Rowley 2015-03-20 08:11:25 Re: Performance improvement for joins where outer side is unique