Re: Selectivity estimation for inet operators

From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dilip kumar <dilip(dot)kumar(at)huawei(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject: Re: Selectivity estimation for inet operators
Date: 2014-09-27 10:35:35
Message-ID: 20140927103535.GA3035@hasegeli.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Thanks. Overall, my impression of this patch is that it works very
> well. But damned if I understood *how* it works :-). There's a lot
> of statistics involved, and it's not easy to see why something is
> multiplied by something else. I'm adding comments as I read through
> it.

Thank you for looking at it. I tried to add more comments to
the multiplications. New version attached. It also fixes a bug
caused by wrong operator order used on histogram to histogram
selectivity estimation for inner join.

> I've gotten to the inet_semi_join_selec function:
>
> > [function]
>
> This desperately needs comment at the top of the function explaining
> what it does. Let me try to explain what I think it does:
>
> [explanation]

I used your explanation on the new version.

> Now, I think that last step is wrong. Firstly, the "Do not bother if
> histogram weight is smaller than 0.1" rule seems bogus. The
> his2_weight is the total number of rows represented by the
> histogram, so surely it can't be less than 1. It can't really be
> less than the statistics target. Unless maybe if the histogram was
> collected when the table was large, but it has since shrunk to
> contain only a few rows, but that seems like a very bizarre corner
> case. At least it needs more comments explaining what the test is
> all about, but I think we should just always use the histogram (if
> it's available).

It was an unnecessary check. I put an assert instead of it.

> Secondly, if we estimate that there is on average 1.0 matching row
> in the table, it does not follow that the probability that at least
> one row matches is 1.0. Assuming a gaussian distribution with mean
> 1.0, the probability that at least one row matches is 0.5. Assuming
> a gaussian distribution here isn't quite right - I guess a Poisson
> distribution would be more accurate - but it sure doesn't seem right
> as it is.
>
> The error isn't very big, and perhaps you don't run into that very
> often, so I'm not sure what the best way to fix that would be. My
> statistics skills are a bit rusty, but I think the appropriate way
> would be to apply the Poisson distribution, with the estimated
> number of matched rows as the mean. The probability of at least one
> match would be the cumulative distribution function at k=1. It
> sounds like overkill, if this is case occurs only rarely. But then
> again, perhaps it's not all that rare.

A function of his_weight and his_selec could be a better option
than just multiplying them. I am not sure about the function or
it worths the trouble. Join selectivity estimation function for
equality doesn't even bother to look at the histograms. Others
only return constant values.

> That said, I can't immediately find a test case where that error
> would matter. I tried this:
>
> create table inettbl1 (a inet);
> insert into inettbl1 select '10.0.0.' || (g % 255) from
> generate_series(1, 10) g;
> analyze inettbl1;
> explain analyze select count(*) from inettbl1 where a >>= ANY
> (SELECT a from inettbl1);
>
> The estimate for that is pretty accurate, 833 rows estimated vs 1000
> actual, with the current patch. I'm afraid if we fixed
> inet_semi_join_selec the way I suggest, the estimate would be
> smaller, i.e. more wrong. Is there something else in the estimates
> that accidentally compensates for this currently?

The partial bucket match on inet_his_inclusion_selec() causes low
estimates. Which also effects non join estimation but not as much as
it effects join estimations. If that works more correctly, semi
join estimation can be higher than it should be.

network_selfuncs.c:602:
> /* Partial bucket match. */
>
> left_divider = inet_his_match_divider(left, query, opr_order);
> right_divider = inet_his_match_divider(right, query, opr_order);
>
> if (left_divider >= 0 || right_divider >= 0)
> match += 1.0 / pow(2, Max(left_divider, right_divider));

I think this calculation can benefit from a statistical function
more than the semi join. Using the different bit count as power
of two is the best I could find. It works quite well on most of
the cases.

Attachment Content-Type Size
inet-selfuncs-v12.patch text/x-diff 29.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2014-09-27 12:00:09 Re: json (b) and null fields
Previous Message Heikki Linnakangas 2014-09-27 08:41:18 Re: Last Commitfest patches waiting review