Re: Selectivity estimation for inet operators

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dilip kumar <dilip(dot)kumar(at)huawei(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject: Re: Selectivity estimation for inet operators
Date: 2014-09-25 09:12:45
Message-ID: 5423DC8D.3080206@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/07/2014 07:09 PM, Emre Hasegeli wrote:
>>>> I updated the patch to cover semi and anti joins with eqjoinsel_semi().
>>>> I think it is better than returning a constant.
>>>
>>> What you did there is utterly unacceptable from a modularity standpoint;
>>> and considering that the values will be nowhere near right, the argument
>>> that "it's better than returning a constant" seems pretty weak. I think
>>> you should just take that out again.
>>
>> I will try to come up with a better, data type specific implementation
>> in a week.
>
> New version with semi join estimation function attached.

Thanks. Overall, my impression of this patch is that it works very well.
But damned if I understood *how* it works :-). There's a lot of
statistics involved, and it's not easy to see why something is
multiplied by something else. I'm adding comments as I read through it.

I've gotten to the inet_semi_join_selec function:

> /*
> * Inet semi join selectivity estimation.
> */
> static Selectivity
> inet_semi_join_selec(bool mcv2_exists, Datum *mcv2_values, int mcv2_nvalues,
> bool his2_exists, Datum *his2_values, int his2_nvalues,
> double his2_weight, Datum *constvalue,
> FmgrInfo *proc, short opr_order)
> {
> if (mcv2_exists)
> {
> int i;
>
> for (i = 0; i < mcv2_nvalues; i++)
> {
> if (DatumGetBool(FunctionCall2Coll(proc, DEFAULT_COLLATION_OID,
> *constvalue, mcv2_values[i])))
> return 1.0;
> }
> }
>
> /* Do not bother if histogram weight is smaller than 0.1. */
> if (his2_exists && his2_weight > 0.1)
> {
> Selectivity his_selec;
>
> his_selec = inet_his_inclusion_selec(his2_values, his2_nvalues,
> constvalue, opr_order);
>
> if (his_selec > 0)
> return Min(1.0, his2_weight * his_selec);
> }
>
> return 0.0;
> }

This desperately needs comment at the top of the function explaining
what it does. Let me try to explain what I think it does:

This function calculates the probability that there is at least one row
in table B, which satisfies the "constant op column" qual. The constant
is passed as argument, and for table B, the MCV list and histogram is
provided. his2_weight is the total number of rows in B that are covered
by the histogram. For example, if the table has 1000 rows, and 10% of
the rows in the table are in the MCV, and another 10% are NULLs,
his_weight would be 800.

First, we check if the constant matches any of the most common values.
If it does, return 1.0, because then there is surely a match.

Next, we use the histogram to estimate the number of rows in the table
that matches the qual. If it amounts to more than 1 row, we return 1.0.
If it's between 0.0 and 1.0 rows, we return that number as the probability.

Now, I think that last step is wrong. Firstly, the "Do not bother if
histogram weight is smaller than 0.1" rule seems bogus. The his2_weight
is the total number of rows represented by the histogram, so surely it
can't be less than 1. It can't really be less than the statistics
target. Unless maybe if the histogram was collected when the table was
large, but it has since shrunk to contain only a few rows, but that
seems like a very bizarre corner case. At least it needs more comments
explaining what the test is all about, but I think we should just always
use the histogram (if it's available).

Secondly, if we estimate that there is on average 1.0 matching row in
the table, it does not follow that the probability that at least one row
matches is 1.0. Assuming a gaussian distribution with mean 1.0, the
probability that at least one row matches is 0.5. Assuming a gaussian
distribution here isn't quite right - I guess a Poisson distribution
would be more accurate - but it sure doesn't seem right as it is.

The error isn't very big, and perhaps you don't run into that very
often, so I'm not sure what the best way to fix that would be. My
statistics skills are a bit rusty, but I think the appropriate way would
be to apply the Poisson distribution, with the estimated number of
matched rows as the mean. The probability of at least one match would be
the cumulative distribution function at k=1. It sounds like overkill, if
this is case occurs only rarely. But then again, perhaps it's not all
that rare.

That said, I can't immediately find a test case where that error would
matter. I tried this:

create table inettbl1 (a inet);
insert into inettbl1 select '10.0.0.' || (g % 255) from
generate_series(1, 10) g;
analyze inettbl1;
explain analyze select count(*) from inettbl1 where a >>= ANY (SELECT a
from inettbl1);

The estimate for that is pretty accurate, 833 rows estimated vs 1000
actual, with the current patch. I'm afraid if we fixed
inet_semi_join_selec the way I suggest, the estimate would be smaller,
i.e. more wrong. Is there something else in the estimates that
accidentally compensates for this currently?

Thoughts?

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-09-25 09:41:29 Re: a fast bloat measurement tool (was Re: Measuring relation free space)
Previous Message Fabien COELHO 2014-09-25 08:16:10 Re: add modulo (%) operator to pgbench