Multi-column distinctness.

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Multi-column distinctness.
Date: 2015-08-28 08:33:34
Message-ID: 20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, this patch enables planner to be couscious of inter-column
correlation.

Sometimes two or more columns in a table has some correlation
which brings underestimate, which leads to wrong join method and
ends with slow execution.

Tomas Vondra is now working on heavily-equipped multivariate
statistics for OLAP usage. In contrast, this is a lightly
implemented solution which calculates only the ratio between a
rows estimated by current method and a actual row number. I think
this doesn't conflict with his work except the grammar part.

This would apply fewer cases but I suppose still in many cases
the correlated colums would be in simple proportional
relationship, so this can help the cases. The previous discussion
is

https://wiki.postgresql.org/wiki/Cross_Columns_Stats
http://www.postgresql.org/message-id/4D0BA4D5.8080707@fuzzy.cz

This patch is covers only the type A (Discrete values and
equality conditions) but I think it is usable in many cases seen
in the field. So I'd like to repropose for the latest version of
PostgreSQL.

- design outline

Provide new system catalog pg_mvcoefficient to store the
information required to do this.

A user can instruct planner to correct the wrong estimation
caused by inter-column correlation by registering the columns in
pg_mvcoefficient using new DDL ALTER TABLE... ADD STATISTICS.

Analyzing of the target table also stores the 'multivariate
coefficient' calculated by using the following formula into
pg_mvcoefficient.

mv_coef(c1, c2, ..) =
ndistinct(c1 * c2 * ...) / (ndistinct(c1) * ndistinct(c2) * ...)


In clauselist_selectivity, planner corrects the estimate if
given clauselist has equivalence-classes-compatible clauses for
required columns at the top-level.

- Example

The attached perl script gentbl.pl generates test data resembles
some tables in DBT-3 benchmark.

> $ perl gentbl.pl | psql postgres

> =# EXPLAIN ANALYZE SELECT * FROM t1 WHERE a = 1 AND b = 2501;
...
> Seq Scan on t1 (cost=0.00..653.00 rows=1 width=12) (actual time=0.021..6.348 rows=8 loops=1)

This doesn't have no harm but in a join case,

> =# EXPLAIN ANALYZE SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t1.b = t2.b;
> Hash Join (cost=122.00..855.32 rows=32 width=24)
> (actual time=2.009..29.208 rows=32000 loops=1)

The correlation between a and b makes the estimate too
small. Then register correlation setting.

> =# ALTER TABLE t1 ADD STATISTICS (mvndistinct) ON (a, b);
> =# ANALYZE t1;

Then the estimate will be corrected.

> =# EXPLAIN ANALYZE SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t1.b = t2.b;
> Hash Join (cost=122.00..855.32 rows=32000 width=24)
> (actual time=1.907..29.025 rows=32000 loops=1)

- Known limitations

The coefficient calculated by this feature is applicble only for
conjunctions of simple var-exprs on merge-joinable operator.

The coefficient is applied regardless of whether the base
estimate has been calculated using MCV, so estimates for
non-join cases on the columns which has MCV can rather become
inaccurate.

Uniform correlation is assumed so some extent of correlation
ununiformity would lead to wrong estimation.

This patch set doesn't contain any document yet.

- Patche Files

This patch consists of the following files.

- 0001-New-system-catalog-pg_mvcoefficient.patch
Adds new system catalog pg_mvcoefficient.

- 0002-Analyze-part-for-multivariate-coefficient.patch
Analyze part of multivariate coefficient.

- 0003-Make-use-of-multivariate-coefficeient-in-estimation-.patch
Planner part to make it use the multivariate coefficient.

- 0004-Syntactical-part-of-multivariate-coefficient.patch
Add new DDL to define mv coefficient columns.

The four files above are essential. The two following files are
experimental patch to add mvcattrs to index columns. One of them
adds a new opclass for int2vector of btree but it would be
overkill.

- 0005-Add-btree-operator-class-for-int2vector.patch
Add btree operator class for int2vector.

- 0006-Use-modified-index-of-pg_mvcoefficient.patch
Use modified index of pg_mvcoefficient.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-New-system-catalog-pg_mvcoefficient.patch text/x-patch 6.1 KB
0002-Analyze-part-for-multivariate-coefficient.patch text/x-patch 12.9 KB
0003-Make-use-of-multivariate-coefficeient-in-estimation-.patch text/x-patch 4.1 KB
0004-Syntactical-part-of-multivariate-coefficient.patch text/x-patch 13.8 KB
0005-Add-btree-operator-class-for-int2vector.patch text/x-patch 11.2 KB
0006-Use-modified-index-of-pg_mvcoefficient.patch text/x-patch 5.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2015-08-28 08:33:44 Multiline-statement and multi-statement for pgbench custom script.
Previous Message Simon Riggs 2015-08-28 07:16:20 Re: 9.5 feature count