Re: multivariate statistics / patch v7

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-08 01:03:16
Message-ID: 559C76D4.2030805@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Horiguchi-san!

On 07/07/2015 09:43 PM, Tomas Vondra wrote:
> -- histograms
> ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c);
> ANALYZE t;
>
> EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
> Seq Scan on t (cost=0.00..23870.00 rows=267033 width=24)
> (actual time=0.021..330.487 rows=273897 loops=1)
>
> EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.3;
> Seq Scan on t (cost=0.00..23870.00 rows=14317 width=24)
> (actual time=0.027..906.321 rows=966870 loops=1)
>
> EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9;
> Seq Scan on t (cost=0.00..23870.00 rows=20367 width=24)
> (actual time=0.028..452.494 rows=393528 loops=1)
>
> This seems wrong, because the estimate for the OR queries should not be
> lower than the estimate for the first query (with just AND), and it
> should not increase when increasing the boundary. I'd bet this is a bug
> in how the inequalities are handled with histograms, or how the AND/OR
> clauses are combined. I'll look into that.

FWIW this was a stupid bug in update_match_bitmap_histogram(), which
initially handled only AND clauses, and thus assumed the "match" of a
bucket can only decrease. But for OR clauses this is exactly the
opposite (we assume no buckets match and add buckets matching at least
one of the clauses).

With this fixed, the estimates look like this:

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
Seq Scan on t (cost=0.00..23870.00 rows=267033 width=24)
(actual time=0.102..321.524 rows=273897 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c < 0.3;
Seq Scan on t (cost=0.00..23870.00 rows=319400 width=24)
(actual time=0.103..386.089 rows=317533 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.3;
Seq Scan on t (cost=0.00..23870.00 rows=956833 width=24)
(actual time=0.133..908.455 rows=966870 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9;
Seq Scan on t (cost=0.00..23870.00 rows=393633 width=24)
(actual time=0.105..440.607 rows=393528 loops=1)

IMHO pretty accurate estimates - no issue with OR clauses.

I've pushed this to github [1] but I need to do some additional fixes. I
also had to remove some optimizations while fixing this, and will have
to reimplement those.

That's not to say that the handling of OR-clauses is perfectly correct.
After looking at clauselist_selectivity_or(), I believe it's a bit
broken and will need a bunch of fixes, as explained in the FIXMEs I
pushed to github.

[1] https://github.com/tvondra/postgres/tree/mvstats

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2015-07-08 02:27:10 Re: FPW compression leaks information
Previous Message David Rowley 2015-07-08 00:29:38 Re: Performance improvement for joins where outer side is unique