Re: BRIN cost estimate

From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BRIN cost estimate
Date: 2015-12-24 16:39:01
Message-ID: CAE2gYzzJvzPy-1cSgZjJyH69izSa13SEgFC4i4r2z0qBQ2P8Uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

> Somebody wrote to me a few days ago that the BRIN cost estimation is
> rather poor. One immediately obvious issue which I think is easily
> fixed is the correlation estimate, which is currently hardcoded to 1.
>
> Since a BRIN scan always entails a scan of the relation in physical
> order, it's simple to produce a better estimate for that, per the
> attached patch. (Note: trying to run this on expression indexes will
> cause a crash. I need to complete that part ...)
>
> There are other improvements we've been discussing, but I'm unclear that
> I will have time to study them to get a useful patch quickly enough. If
> somebody else wants to mess with this, please get in touch.
>
> Thoughts?

As we have discusses, the correlation is not used for bitmap index
scan cost estimation. I think the best we can do in here is to somehow
increase the selectivity. I suggest something like this:

/*
* Index selectivity is important for the planner to calculate the cost
* of the bitmap heap scan. Unfortunately, we don't have a robust way to
* estimate selectivity of BRIN. It can dependent on many things. This
* is a long rationale about the incomplete calculation we have at the
* moment.
*
* Our starting point is that BRIN selectivity has to be less than the
* selectivity of the btree. We are using a product of logical and
* physical selectivities to achieve this. The equality of
*
* (1 + logical_selectivity) * (1 + physical_selectivity) - 1
*
* is used to make sure the result is not less than any of the values.
*
* The logical selectivity is calculated using the indexable expressions
* of the WHERE clause. The physical selectivity is calculated using
* the block proportion and the maximum correlation. The block
* proportion is a comparable value with selectivity. It is the
* selectivity of the smallest unit of the index. The final selectivity
* can never be less than that.
*
* Using the contrary of the correlation by subtracting it from 1 is not
* not really a comparable value with the selectivity. It is just a
* value between 0 and 1. On the other hand, it is the only value
* related to the BRIN quality, we have available right now. We are
* using the arithmetic of it with the block proportion to normalise it.
* This part of the physical selectivity is likely to be more effective
* than the block proportion in many circumstances as there would be
* many blocks on big tables.
*
* Using the contrary of the correlation of a column as selectivity of
* the index is wrong in many ways. First of all, it cannot be applied
* to all BRIN operator classes. It makes sense for the main built-in
* operator class "minmax", and makes a little sense for the other one
* "inclusion". It wouldn't probably make any sense for a bloom filter
* implementation, if there would be any. Maybe, we should push down
* this function to the operator class, but there is not enough reason
* to do it right now.
*
* Second, correlation is not dependent to any indexed expression. It
* probably doesn't make any sense for the complicated operators. It
* would probably effect basic comparison operators differently than
* equality operator. The effect would even differ by count of those
* expressions. For example, x IN (10, 20, 30) would be effected from
* correlation more than x = 15, even when their selectivities are the
* same.
*
* Last but not least, the correlation is a single value for the whole
* range. The indexed table can partly be very well correlated, but
* the correlation value can still be very low. For example, if a
* perfectly correlated table is copied 4 times, the correlation would
* be 0.25, although the index would be almost as good as the version on
* the initial table. Or the expression can match the better correlated
* part of the table. It is not hard to imagine more scenarios where
* the correlation is a bad value to use as the selectivity. We should
* probably improve this by collecting more statistics, one day.
*
* Another problem in here is that the caller assumes the selectivity
* by tuples. It might have been better, if we had a way to return it
* as some number of pages. On the other hand, even though we know about
* the index, it is not too much easier for us to estimate the number of
* matching pages then it is for the caller. We are likely to make too
* much mistake by relying on the correlation, anyway. We are at least
* not making it worse in here.
*/
blockSelectivity = (blockProportion + 1 - *indexCorrelation) / 2;
selec = (1 + qualSelectivity) * (1 + blockSelectivity) - 1;
CLAMP_PROBABILITY(selec);

The patch is attached.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-12-24 16:55:16 Re: WIP: bloom filter in Hash Joins with batches
Previous Message Tom Lane 2015-12-24 15:44:22 Re: Comment typo in pg_dump.h