Re: Selectivity estimation for inet operators

From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Dilip kumar <dilip(dot)kumar(at)huawei(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject: Re: Selectivity estimation for inet operators
Date: 2014-12-02 21:09:27
Message-ID: 20141202210927.GA3990@hasegeli.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I spent a fair chunk of the weekend hacking on this patch to make
> it more understandable and fix up a lot of what seemed to me pretty
> clear arithmetic errors in the "upper layers" of the patch. However,
> I couldn't quite convince myself to commit it, because the business
> around estimation for partial histogram-bucket matches still doesn't
> make any sense to me. Specifically this:
>
> /* Partial bucket match. */
> left_divider = inet_hist_match_divider(left, query, opr_codenum);
> right_divider = inet_hist_match_divider(right, query, opr_codenum);
>
> if (left_divider >= 0 || right_divider >= 0)
> match += 1.0 / pow(2.0, Max(left_divider, right_divider));
>
> Now unless I'm missing something pretty basic about the divider
> function, it returns larger numbers for inputs that are "further away"
> from each other (ie, have more not-in-common significant address bits).
> So the above calculation seems exactly backwards to me: if one endpoint
> of a bucket is "close" to the query, or even an exact match, and the
> other endpoint is further away, we completely ignore the close/exact
> match and assign a bucket match fraction based only on the further-away
> endpoint. Isn't that exactly backwards?

You are right that partial bucket match calculation isn't fair on
some circumstances.

> I experimented with logic like this:
>
> if (left_divider >= 0 && right_divider >= 0)
> match += 1.0 / pow(2.0, Min(left_divider, right_divider));
> else if (left_divider >= 0 || right_divider >= 0)
> match += 1.0 / pow(2.0, Max(left_divider, right_divider));
>
> ie, consider the closer endpoint if both are valid. But that didn't seem
> to work a whole lot better. I think really we need to consider both
> endpoints not just one to the exclusion of the other.

I have tried many combinations like this. Including arithmetic,
geometric, logarithmic mean functions. I could not get good results
with any of them, so I left it in a basic form.

Max() works pretty well most of the time, because if the query matches
the bucket generally it is close to both of the endpoints. By using
Max(), we are actually crediting the ones which are close to the both
of the endpoints.

> I'm also not exactly convinced by the divider function itself,
> specifically about the decision to fail and return -1 if the masklen
> comparison comes out wrong. This effectively causes the masklen to be
> the most significant part of the value (after the IP family), which seems
> totally wrong. ISTM we ought to consider the number of leading bits in
> common as the primary indicator of "how far apart" a query and a
> histogram endpoint are.

The partial match calculation with Max() is especially unfair on
the buckets where more significant bits change. For example 63/8 and
64/8. Returning -1 instead of a high divider, forces it to use
the divider for the other endpoint. We consider the number of leading
bits in common as the primary indicator, just for the other endpoint.

I have also experimented with the count of the common bits of
the endpoints of the bucket for better partial match calculation.
I could not find out a meaningful equation with it.

> Even if the above aspects of the code are really completely right, the
> comments fail to explain why. I spent a lot of time on the comments,
> but so far as these points are concerned they still only explain what
> is being done and not why it's a useful calculation to make.

I couldn't write better comments because I don't have strong arguments
about it. We can say that we don't try to make use of the both of
the endpoints, because we don't know how to combine them. We only use
the one with matching family and masklen, and when both of them match
we use the distant one to be on the safer side.

Thank you for looking at it. Comments look much better now.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-12-02 21:11:05 Re: Doing better at HINTing an appropriate column within errorMissingColumn()
Previous Message Robert Haas 2014-12-02 21:00:33 Re: B-Tree support function number 3 (strxfrm() optimization)