Re: multivariate statistics / patch v6

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: tomas(dot)vondra(at)2ndquadrant(dot)com
Cc: jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics / patch v6
Date: 2015-05-13 08:31:59
Message-ID: 20150513.173159.102421446.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, this might be somewhat out of place but strongly related
to this patch so I'll propose this here.

This is a proposal of new feature for this patch or asking for
your approval for my moving on this as a different (but very
close) project.

===

> Attached is v6 of the multivariate stats, with a number of
> improvements:
...
> 2) fix of pg_proc issues (reported by Jeff)
>
> 3) rebase to current master

Unfortunately, the v6 patch suffers some system oid conflicts
with recently added ones. And what more unfortunate for me is
that the code for functional dependencies looks undone:)

I mention this because I recently had a issue from strong
correlation between two columns in dbt3 benchmark. Two columns in
some table are in strong correlation but not in functional
dependencies, there are too many values and the distribution of
them is very uniform so MCV is no use for the table (histogram
has nothing to do with equal conditions). As the result, planner
estimates the number of rows largely wrong as expected especially
for joins.

I, then, had a try calculating the ratio between the product of
distinctness of every column and the distinctness of the set of
the columns, call it multivariate coefficient here, and found
that it looks greately useful for the small storage space, less
calculation, and simple code.

The attached first is a script to generate problematic tables.
And the second is a patch to make use of the mv coef on current
master. The patch is a very primitive POC so no syntactical
interfaces involved.

For the case of your first example,

> =# create table t (a int, b int, c int);
> =# insert into t (select a/10000, a/10000, a/10000
> from generate_series(0, 999999) a);
> =# analyze t;
> =# explain analyze select * from t where a = 1 and b = 1 and c = 1;
> Seq Scan on t (cost=0.00..22906.00 rows=1 width=12)
> (actual time=3.878..250.628 rows=10000 loops=1)

Make use of mv coefficient.

> =# insert into pg_mvcoefficient values ('t'::regclass, 1, 2, 3, 0);
> =# analyze t;
> =# explain analyze select * from t where a = 1 and b = 1 and c = 1;
> Seq Scan on t (cost=0.00..22906.00 rows=9221 width=12)
> (actual time=3.740..242.330 rows=10000 loops=1)

Row number estimation was largely improved.

Well, my example,

> $ perl gentbl.pl 10000 | psql postgres
> $ psql postgres
> =# explain analyze select * from t1 where a = 1 and b = 2501;
> Seq Scan on t1 (cost=0.00..6216.00 rows=1 width=8)
> (actual time=0.030..66.005 rows=8 loops=1)
>
> =# explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
> Hash Join (cost=1177.00..11393.76 rows=76 width=16)
> (actual time=29.811..322.271 rows=320000 loops=1)

Too bad estimate for the join.

> =# insert into pg_mvcoefficient values ('t1'::regclass, 1, 2, 0, 0);
> =# analyze t1;
> =# explain analyze select * from t1 where a = 1 and b = 2501;
> Seq Scan on t1 (cost=0.00..6216.00 rows=8 width=8)
> (actual time=0.032..104.144 rows=8 loops=1)
>
> =# explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
> Hash Join (cost=1177.00..11393.76 rows=305652 width=16)
> (actual time=40.642..325.679 rows=320000 loops=1)

It gives almost correct estimations.

I think the result above shows that the multivariate coefficient
is significant to imporove estimates when correlated colums are
involved.

Would you consider this in your patch? Otherwise I move on this
as a different project from yours if you don't mind. Except user
interface won't conflict with yours, I suppose. But finally they
should need some labor of consolidation.

regards,

> 1) fix of the contrib compile-time errors (reported by Jeff)
>
> 2) fix of pg_proc issues (reported by Jeff)
>
> 3) rebase to current master
>
> 4) fix a bunch of issues in the previous patches, due to referencing
> some parts too early (e.g. histograms in the first patch, etc.)
>
> 5) remove the explicit DELETEs from pg_mv_statistic (in the regression
> tests), this is now handled automatically by DROP TABLE etc.
>
> 6) number of performance optimizations in selectivity estimations:
>
> (a) minimize calls to get_oprrest, significantly reducing
> syscache calls
>
> (b) significant reduction of palloc overhead in deserialization of
> MCV lists and histograms
>
> (c) use more compact serialized representation of MCV lists and
> histograms, often removing ~50% of the size
>
> (d) use histograms with limited deserialization, which also allows
> caching function calls
>
> (e) modified histogram bucket partitioning, resulting in more even
> bucket distribution (i.e. producing buckets with more equal
> density and about equal size of each dimension)
>
> 7) add functions for listing MCV list items and histogram buckets:
>
> - pg_mv_mcvlist_items(oid)
> - pg_mv_histogram_buckets(oid, type)
>
> This is quite useful when analyzing the MCV lists / histograms.
>
> 8) improved support for OR clauses
>
> 9) allow calling pull_varnos() on expression trees containing
> RestrictInfo nodes (not sure if this is the right fix, it's being
> discussed in another thread)

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
unknown_filename text/plain 650 bytes
mvcoef-poc-20150513.patch text/x-patch 19.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2015-05-13 08:35:28 Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
Previous Message Etsuro Fujita 2015-05-13 08:25:16 Re: EvalPlanQual behaves oddly for FDW queries involving system columns