Re: multivariate statistics (v19)

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, David Steele <david(at)pgmasters(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2016-10-11 03:39:23
Message-ID: 277d9678-7a35-a746-0eb5-41d4bcd4ef55@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi everyone,

thanks for the reviews. Let me sum the feedback so far, and outline my
plans for the next patch version that I'd like to submit for CF 2016-11.

1) syntax changes

I agree with the changes proposed by Dean, although only a subset of the
syntax is going to be supported until we add support for either join or
partial statistics. So something like this:

CREATE STATISTICS name
[ WITH (options) ]
ON (column1, column2 [, ...])
FROM table

That should be a difficult change.

2) catalog names

I'm not sure what are the best names, so I'm fine with using whatever is
the consensus.

That being said, I'm not sure I like extending the catalog to also
support non-multivariate statistics (like for example statistics on
expressions). While that would be a clearly useful feature, it seems
like a slightly different use case and perhaps a separate catalog would
be better. So maybe pg_statistic_ext is not the best name.

3) special data type(s) to store statistics

I agree using an opaque bytea value is not very nice. I see Heikki
proposed using something like pg_node_tree, and maybe storing all the
statistics in a single value.

I assume the pg_node_tree was meant only as an inspiration how to build
pseudo-type on top of a varlena value. I agree that's a good idea, and I
plan to do something like that - say adding pg_mcv, pg_histogram,
pg_ndistinct and pg_dependencies data types.

Heikki also mentioned that maybe JSONB would be a good way to store the
statistics. I don't think so - firstly, it only supports a subset of
data types, so we'd be unable to store statistics for some data types
(or we'd have to store them as text, which sucks). Also, there's a fair
amount of smartness in how the statistics are stored (e.g. how the
histogram bucket boundaries are deduplicated, or how the estimation uses
the serialized representation directly). We'd lose all of that when
using JSONB.

Similarly for storing all the statistics in a single value - I see no
reason why keeping the statistics in separate columns would be a bad
idea (after all, that's kinda the point of relational databases). Also,
there are perfectly valid cases when the caller only needs a particular
type of statistic - e.g. when estimating GROUP BY we'll only need the
ndistinct coefficients. Why should we force the caller to fetch and
detoast everything, and throw away probably 99% of that?

So my plan here is to define pseudo types similar to how pg_node_tree is
defined. That does not seem like a tremendous amount of work.

4) functional dependencies

Several people mentioned they don't like how functional dependencies are
detected at ANALYZE time, particularly that there's a sudden jump
between 0 and 1. Instead, a continuous "dependency degree" between 0 and
1 was proposed.

I'm fine with that, although that makes "clause reduction" (deciding
that we don't need to estimate one of the clauses at all, as it's
implied by some other clause) impossible. But that's fine, the
functional dependencies will still be much less expensive than the other
statistics.

I'm wondering how will this interact with transitivity, though. IIRC the
current implementation is able to detect transitive dependencies and use
that to reduce storage space etc.

In any case, this significantly complicates the functional dependencies,
which were meant as a trivial type of statistics, mostly to establish
the shared infrastructure. Which brings me to ndistinct.

5) ndistinct

So far, the ndistinct coefficients were lumped at the very end of the
patch, and the statistic was only built but not used for any sort of
estimation. I agree with Dean that perhaps it'd be better to move this
to the very beginning, and use it as the simplest statistic to build the
infrastructure instead of functional dependencies (which only gets truer
due to the changes in functional dependencies, discussed in the
preceding section).

I think it's probably a good idea and I plan to do that, so the patch
series will probably look like this:

* 001 - CREATE STATISTICS infrastucture with ndistinct coefficients
* 002 - use ndistinct coefficients to improve GROUP BY estimates
* 003 - use ndistinct coefficients in clausesel.c (not sure)
* 004 - add functional dependencies (build + clausesel.c)
* 005 - add multivariate MCV (build + clausesel.c)
* 006 - add multivariate histograms (build + clausesel.c)

I'm not sure about using the ndistinct coefficients in clausesel.c to
estimate regular conditions - it's the place for which ndistinct
coefficients were originally proposed by Kyotaro-san, but I seem to
remember it was non-trivial to choose the best statistics when there
were other types of stats available. But I'll look into that.

6) combining statistics

I've decided not to re-submit this part of the patch until the basic
functionality gets in. I do think it's a very useful feature (despite
having my doubts about the existing implementation), but it clearly
distracts people.

Instead, the patch will use some simple selection strategy (e.g. using a
single statistics covering most conditions) or perhaps something more
advanced (e.g. non-overlapping statistics). But nothing complicated.

7) enriching the query plan

Sadly, none of the reviews provides any sort of feedback on how to
enrich the query plan with information about statistics (instead of
doing that in clausesel.c in ad-hoc ephemeral manner).

So I'm still a bit stuck on this :-(

8) join statistics

Not directly related to the current patch, but I recommend reading this
paper quantifying impact of each part of query optimizer (estimates,
cost model, plan enumeration):

http://www.vldb.org/pvldb/vol9/p204-leis.pdf

The one conclusion that I take from it is we really need to think about
improving the join estimates, somehow. Because it's by far the most
significant source of issues (and the hardest one to fix).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2016-10-11 05:19:08 Re: Autovacuum launcher process launches worker process at high frequency
Previous Message Noah Misch 2016-10-11 03:30:49 Re: "Re: Question about grant create on database and pg_dump/pg_dumpall