Re: multivariate statistics (v19)

From: David Fetter <david(at)fetter(dot)org>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2017-02-12 19:42:07
Message-ID: 20170212194207.GA2120@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 12, 2017 at 10:35:04AM +0000, Dean Rasheed wrote:
> On 11 February 2017 at 01:17, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > Thanks for the feedback, I'll fix this. I've allowed myself to be a bit
> > sloppy because the number of attributes in the statistics is currently
> > limited to 8, so the overflows are currently not an issue. But it doesn't
> > hurt to make it future-proof, in case we change that mostly artificial limit
> > sometime in the future.
> >
>
> Ah right, so it can't overflow at present, but it's neater to have an
> overflow-proof algorithm.
>
> Thinking about the exactness of the division steps is quite
> interesting. Actually, the order of the multiplying factors doesn't
> matter as long as the divisors are in increasing order. So in both my
> proposal:
>
> result = 1
> for (i = 1; i <= k; i++)
> result = (result * (n-k+i)) / i;
>
> and David's proposal, which is equivalent but has the multiplying
> factors in the opposite order, equivalent to:
>
> result = 1
> for (i = 1; i <= k; i++)
> result = (result * (n-i+1)) / i;
>
> the divisions are exact at each step. The first time through the loop
> it divides by 1 which is trivially exact. The second time it divides
> by 2, having multiplied by 2 consecutive factors, one of which is
> therefore guaranteed to be divisible by 2. The third time it divides
> by 3, having multiplied by 3 consecutive factors, one of which is
> therefore guaranteed to be divisible by 3, and so on.

Right. You know you can use integer division, which make sense as
permutations of discrete sets are always integers.

> My approach originally seemed more logical to me because of the way it
> derives from the recurrence relation binomial(n, k) = binomial(n-1,
> k-1) * n / k, but they both work fine as long as they have suitable
> overflow checks.

Right. We could even cache those checks (sorry) based on data type
limits by architecture and OS if performance on those operations ever
matters that much.

> It's also interesting that descriptions of this algorithm tend to
> talk about setting k to min(k, n-k) at the start as an optimisation
> step, as I did in fact, whereas it's actually more than that -- it
> helps prevent unnecessary intermediate overflows when k > n/2. Of
> course, that's not a worry for the current use of this function, but
> it's good to have a robust algorithm.

Indeed. :)

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2017-02-12 23:06:31 Re: Improve OR conditions on joined columns (common star schema problem)
Previous Message Vladimir Rusinov 2017-02-12 17:55:56 Re: WIP: About CMake v2