PoC/WIP: Extended statistics on expressions

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: PoC/WIP: Extended statistics on expressions
Date: 2020-11-16 13:49:42
Message-ID: ad7891d2-e90c-b446-9fe2-7419143847d7@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached is a PoC/WIP patch adding support for extended statistics on
expressions. This is by no means "ready" - most of the stuff works, but
often in a rather hackish way. I certainly don't expect this to pass
regression tests, for example.

There's an example demonstrating how this works for two queries at the
end of this message. Now let's talk about the main parts of the patch:

1) extending grammar to allow expressions, not just plain columns

Fairly straighforward, I think. I'm sure the logic which expressions
are allowed is not 100% (e.g. volatile functions etc.) but that's
a detail we can deal with later.

2) store the expressions in pg_statistic_ext catalog

I ended up adding a separate column, similar to indexprs, except that
the order of columns/expressions does not matter, so we don't need to
bother with storing 0s in stxkeys - we simply consider expressions to
be "after" all the simple columns.

3) build statistics

This should work too, for all three types of statistics we have (mcv,
dependencies and ndistinct). This should work too, although the code
changes are often very hackish "to make it work".

The main challenge here was how to represent the expressions in the
statistics - e.g. in ndistinct, which track ndistinct estimates for
combinations of parameters, and so far we used attnums for that. I
decided the easiest way it to keep doing that, but offset the
expressions by MaxHeapAttributeNumber. That seems to work, but maybe
there's a better way.

4) apply the statistics

This is the hard part, really, and the exact state of the support
depends on type of statistics.

For ndistinct coefficients, it generally works. I'm sure there may be
bugs in estimate_num_groups, etc. but in principle it works.

For MCV lists, it generally works too - you can define statistics on
the expressions and the estimates should improve. The main downside
here is that it requires at least two expressions, otherwise we can't
build/apply the extended statistics. So for example

SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0

may be estimated "correctly", once you drop any of the conditions it
gets much worse as we don't have stats for individual expressions.
That's rather annoying - it does not break the extended MCV, but the
behavior will certainly cause confusion.

For functional dependencies, the estimation does not work yet. Also,
the missing per-column statistics have bigger impact than on MCV,
because while MCV can work fine without it, the dependencies heavily
rely on the per-column estimates. We only apply "corrections" based
on the dependency degree, so we still need (good) per-column
estimates, which does not quite work with the expressions.

Of course, the lack of per-expression statistics may be somewhat
fixed by adding indexes on expressions, but that's kinda expensive.

Now, a simple example demonstrating how this improves estimates - let's
create a table with 1M rows, and do queries with mod() expressions on
it. It might be date_trunc() or something similar, that'd work too.

table with 1M rows
==================

test=# create table t (a int);
test=# insert into t select i from generate_series(1,1000000) s(i);
test=# analyze t;

poorly estimated queries
========================

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
QUERY PLAN

----------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..24425.00 rows=25 width=4) (actual rows=47619
loops=1)
Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
Rows Removed by Filter: 952381
Planning Time: 0.329 ms
Execution Time: 156.675 ms
(5 rows)

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
QUERY PLAN

-----------------------------------------------------------------------------------------------
HashAggregate (cost=75675.00..98487.50 rows=1000000 width=8) (actual
rows=21 loops=1)
Group Key: mod(a, 3), mod(a, 7)
Planned Partitions: 32 Batches: 1 Memory Usage: 1561kB
-> Seq Scan on t (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
Planning Time: 0.277 ms
Execution Time: 502.803 ms
(6 rows)

improved estimates
==================

test=# create statistics s1 (ndistinct) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t
group by 1, 2;
QUERY PLAN

-----------------------------------------------------------------------------------------------
HashAggregate (cost=24425.00..24425.31 rows=21 width=8) (actual
rows=21 loops=1)
Group Key: mod(a, 3), mod(a, 7)
Batches: 1 Memory Usage: 24kB
-> Seq Scan on t (cost=0.00..19425.00 rows=1000000 width=8) (actual
rows=1000000 loops=1)
Planning Time: 0.135 ms
Execution Time: 500.092 ms
(6 rows)

test=# create statistics s2 (mcv) on mod(a,3), mod(a,7) from t;
test=# analyze t;

test=# explain (analyze, timing off) select * from t where mod(a,3) = 0
and mod(a,7) = 0;
QUERY PLAN

-------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..24425.00 rows=46433 width=4) (actual
rows=47619 loops=1)
Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0))
Rows Removed by Filter: 952381
Planning Time: 0.702 ms
Execution Time: 152.280 ms
(5 rows)

Clearly, estimates for both queries are significantly improved. Of
course, this example is kinda artificial/simplistic.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-Support-for-extended-statistics-on-expressi-20201116.patch text/x-patch 150.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2020-11-16 14:03:23 Re: Add important info about ANALYZE after create Functional Index
Previous Message Merlin Moncure 2020-11-16 12:59:23 Re: Zedstore - compressed in-core columnar storage