Re: Additional improvements to extended statistics

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Additional improvements to extended statistics
Date: 2020-03-11 14:21:02
Message-ID: CAEZATCUic8PwhTnexC+Ux-Z_e5MhWD-8jk=J1MtnVW8TJD+VHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 9 Mar 2020 at 18:19, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:>
> On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:
> >
> > P(a,b) = P(a) * [f + (1-f)*P(b)]
> >
> >because it might return a value that is larger that P(b), which
> >obviously should not be possible.
>
> Hmmm, yeah. It took me a while to reproduce this - the trick is in
> "inverting" the dependency on a subset of data, e.g. like this:
>
> create table t (a int, b int);
> insert into t select mod(i,50), mod(i,25)
> from generate_series(1,5000) s(i);
> update t set a = 0 where a < 3;
> create statistics s (dependencies) on a,b from t;
>
> which then does this:
>
> test=# explain select * from t where a = 0;
> QUERY PLAN
> ----------------------------------------------------
> Seq Scan on t (cost=0.00..86.50 rows=300 width=8)
> Filter: (a = 0)
> (2 rows)
>
> test=# explain select * from t where b = 0;
> QUERY PLAN
> ----------------------------------------------------
> Seq Scan on t (cost=0.00..86.50 rows=200 width=8)
> Filter: (b = 0)
> (2 rows)
>
> test=# explain select * from t where a = 0 and b = 0;
> QUERY PLAN
> ----------------------------------------------------
> Seq Scan on t (cost=0.00..99.00 rows=283 width=8)
> Filter: ((a = 0) AND (b = 0))
> (2 rows)
>
> Which I think is the issue you've described ...
>

I think this is also related to the problem that functional dependency
stats don't take into account the fact that the user clauses may not
be compatible with one another. For example:

CREATE TABLE t (a int, b int);
INSERT INTO t
SELECT x/10,x/10 FROM (SELECT generate_series(1,x)
FROM generate_series(1,100) g(x)) AS t(x);
CREATE STATISTICS s (dependencies) ON a,b FROM t;
ANALYSE t;

EXPLAIN SELECT * FROM t WHERE a = 10;

QUERY PLAN
--------------------------------------------------
Seq Scan on t (cost=0.00..86.12 rows=1 width=8)
Filter: (a = 10)
(2 rows)

EXPLAIN SELECT * FROM t WHERE b = 1;

QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.12 rows=865 width=8)
Filter: (b = 1)
(2 rows)

EXPLAIN SELECT * FROM t WHERE a = 10 AND b = 1;

QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..98.75 rows=865 width=8)
Filter: ((a = 10) AND (b = 1))
(2 rows)

whereas without stats it would estimate 1 row. That kind of
over-estimate could get very bad, so it would be good to find a way to
fix it.

> >We should amend that formula to prevent a result larger than P(b). The
> >obvious way to do that would be to use:
> >
> > P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))
> >
> >but actually I think it would be better and more principled to use:
> >
> > P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)
> >
> >I.e., for those rows believed to be functionally dependent, we use the
> >minimum probability, and for the rows believed to be independent, we
> >use the product.
> >
>
> Hmmm, yeah. The trouble is that we currently don't really have both
> selectivities in dependencies_clauselist_selectivity :-(
>

I hacked on this a bit, and I think it's possible to apply dependency
stats in a more general way (not necessarily assuming equality
clauses), something like the attached very rough patch.

This approach guarantees that the result of combining a pair of
selectivities with a functional dependency between them gives a
combined selectivity that is never greater than either individual
selectivity.

One regression test fails, but looking at it, that's to be expected --
the test alters the type of a column, causing its univariate stats to
be dropped, so the single-column estimate is reduced, and the new code
refuses to give a higher estimate than the single clause's new
estimate.

> It's also not clear to me how would this work for more than two clauses,
> that are all functionally dependent. Like (a => b => c), for example.
> But I haven't thought about this very much yet.
>

I attempted to solve that by computing a chain of conditional
probabilities. The maths needs checking over (as I said, this is a
very rough patch). In particular, I think it's wrong for cases like (
a->b, a->c ), but I think it's along the right lines.

> >I think that would solve the problem with the example you gave at the
> >end of [2], but I'm not sure if it helps with the general case.
> >
>
> I don't follow. I think the issue with [2] is that we can't really apply
> stats about the array values to queries on individual array elements.
> Can you explain how would the proposed changes deal with this?
>

With this patch, the original estimate of ~900 rows in that example is
restored with functional dependencies, because of the way it utilises
the minimum selectivity of the 2 clauses.

I've not fully thought this through, but I think it might allow
functional dependencies to be applied to a wider range of operators.

Regards,
Dean

Attachment Content-Type Size
dependencies_clauselist_selectivity.patch text/x-patch 10.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-03-11 14:30:53 Re: more ALTER .. DEPENDS ON EXTENSION fixes
Previous Message Alvaro Herrera 2020-03-11 14:14:12 Re: more ALTER .. DEPENDS ON EXTENSION fixes