Re: Additional improvements to extended statistics

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Additional improvements to extended statistics
Date: 2020-03-09 18:19:15
Message-ID: 20200309181915.5lxhuw2qxoihfoqo@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:
>On Mon, 9 Mar 2020 at 00:02, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> Speaking of which, would you take a look at [1]? I think supporting SAOP
>> is fine, but I wonder if you agree with my conclusion we can't really
>> support inclusion @> as explained in [2].
>>
>
>Hmm, I'm not sure. However, thinking about your example in [2] reminds
>me of a thought I had a while ago, but then forgot about --- there is
>a flaw in the formula used for computing probabilities with functional
>dependencies:
>
> P(a,b) = P(a) * [f + (1-f)*P(b)]
>
>because it might return a value that is larger that P(b), which
>obviously should not be possible.

Hmmm, yeah. It took me a while to reproduce this - the trick is in
"inverting" the dependency on a subset of data, e.g. like this:

create table t (a int, b int);
insert into t select mod(i,50), mod(i,25)
from generate_series(1,5000) s(i);
update t set a = 0 where a < 3;
create statistics s (dependencies) on a,b from t;

which then does this:

test=# explain select * from t where a = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=300 width=8)
Filter: (a = 0)
(2 rows)

test=# explain select * from t where b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=200 width=8)
Filter: (b = 0)
(2 rows)

test=# explain select * from t where a = 0 and b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..99.00 rows=283 width=8)
Filter: ((a = 0) AND (b = 0))
(2 rows)

Which I think is the issue you've described ...

>We should amend that formula to prevent a result larger than P(b). The
>obvious way to do that would be to use:
>
> P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))
>
>but actually I think it would be better and more principled to use:
>
> P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)
>
>I.e., for those rows believed to be functionally dependent, we use the
>minimum probability, and for the rows believed to be independent, we
>use the product.
>

Hmmm, yeah. The trouble is that we currently don't really have both
selectivities in dependencies_clauselist_selectivity :-(

We get both clauses, but we only compute selectivity of the "implied"
clause, and we leave the other one as not estimated (possibly up to
clauselist_selectivity).

It's also not clear to me how would this work for more than two clauses,
that are all functionally dependent. Like (a => b => c), for example.
But I haven't thought about this very much yet.

>I think that would solve the problem with the example you gave at the
>end of [2], but I'm not sure if it helps with the general case.
>

I don't follow. I think the issue with [2] is that we can't really apply
stats about the array values to queries on individual array elements.
Can you explain how would the proposed changes deal with this?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-03-09 18:30:18 Re: logical replication empty transactions
Previous Message Robert Haas 2020-03-09 17:53:18 Re: range_agg