Re: multivariate statistics / patch v7

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-04 19:07:02
Message-ID: 55982ED6.7000807@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Horiguchi-san!

On 07/03/2015 07:30 AM, Kyotaro HORIGUCHI wrote:
> Hello, I started to work on this patch.
>
>> attached is v7 of the multivariate stats patch. The main improvement
>> is major refactoring of the clausesel.c portion - splitting the
>> awfully long spaghetti-style functions into smaller pieces, making it
>> much more understandable etc.
>
> Thank you, it looks clearer. I have some comment for the brief look
> at this. This patchset is relatively large so I will comment on
> "per-notice" basis.. which means I'll send comment before examining
> the entire of this patchset. Sorry in advance for the desultory
> comments.

Sure. If you run into something that's not clear enough, I'm happy to
explain that (I tried to cover all the important details in the
comments, but it's a large patch, indeed.)

> =======
> General comments:
>
> - You included unnecessary stuffs such like regression.diffs in
> these patches.

Ahhhh :-/ Will fix.

>
> - Now OID 3307 is used by pg_stat_file. I moved
> pg_mv_stats_dependencies_info/show to 3311/3312.

Will fix while rebasing to current master.

>
> - Single-variate stats have a mechanism to inject arbitrary
> values as statistics, that is, get_relation_stats_hook and the
> similar stuffs. I want the similar mechanism for multivariate
> statistics, too.

Fair point, although I'm not sure where should we place the hook, how
exactly should it be defined and how useful that would be in the end.
Can you give an example of how you'd use such hook?

I've never used get_relation_stats_hook, but if I get it right, the
plugins can use the hook to create the stats (for each column), either
from scratch or tweaking the existing stats.

I'm not sure how this should work with multivariate stats, though,
because there can be arbitrary number of stats for a column, and it
really depends on all the clauses (so examine_variable() seems a bit
inappropriate, as it only sees a single variable at a time).

Moreover, with multivariate stats

(a) there may be arbitrary number of stats for a column

(b) only some of the stats end up being used for the estimation

I see two or three possible places for calling such hook:

(a) at the very beginning, after fetching the list of stats

- sees all the existing stats on a table
- may add entirely new stats or tweak the existing ones

(b) after collecting the list of variables compatible with
multivariate stats

- like (a) and additionally knows which columns are interesting
for the query (but only with respect to the existing stats)

(c) after optimization (selection of the right combination if stats)

- like (b), but can't affect the optimization

But I can't really imagine anyone building multivariate stats on the
fly, in the hook.

It's more complicated, though, because the query may call
clauselist_selectivity multiple times, depending on how complex the
WHERE clauses are.

> 0001:
>
> - I also don't think it is right thing for expression_tree_walker
> to recognize RestrictInfo since it is not a part of expression.

Yes. In my working git repo, I've reworked this to use the second
option, i.e. adding RestrictInfo pull_(varno|varattno)_walker:

https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f

Do you think this is the correct solution? If not, how to fix it?

>
> 0003:
>
> - In clauselist_selectivity, find_stats is uselessly called for
> single clause. This should be called after the clauselist found
> to consist more than one clause.

Ok, will fix.

>
> - Searching vars to be compared with mv-stat columns which
> find_stats does should stop at disjunctions. But this patch
> doesn't behave so and it should be an unwanted behavior. The
> following steps shows that.

Why should it stop at disjunctions? There's nothing wrong with using
multivariate stats to estimate OR-clauses, IMHO.

>
> ====
> =# CREATE TABLE t1 (a int, b int, c int);
> =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, 9999) a);
> =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
> Seq Scan on t1 (cost=0.00..230.00 rows=1 width=12)
> =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
> =# ANALZYE t1;
> =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
> Seq Scan on t1 (cost=0.00..230.00 rows=268 width=12)
> ====
> Rows changed unwantedly.

That has nothing to do with OR clauses, but rather with using a type of
statistics that does not fit the data and queries. Histograms are quite
inaccurate for discrete data and equality conditions - in this case the
clauses probably match one bucket, and so we use 1/2 the bucket as an
estimate. There's nothing wrong with that.

So let's use MCV instead:

ALTER TABLE t1 ADD STATISTICS (MCV) ON (a, b, c);
ANALYZE t1;
EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
QUERY PLAN
-----------------------------------------------------
Seq Scan on t1 (cost=0.00..230.00 rows=1 width=12)
Filter: (((a = 1) AND (b = 2)) OR (c = 3))
(2 rows)

> It seems not so simple thing as your code assumes.

Maybe, but I don't see what assumption is invalid? I see nothing wrong
with the previous query.

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-07-04 22:14:21 Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
Previous Message Andrew Dunstan 2015-07-04 17:09:43 Re: PostgreSQL 9.5 Alpha 1 build fail with perl 5.22