Re: multivariate statistics (v19)

From: David Fetter <david(at)fetter(dot)org>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2017-02-08 16:09:08
Message-ID: 20170208160908.GC8118@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 08, 2017 at 03:23:25PM +0000, Dean Rasheed wrote:
> On 6 February 2017 at 21:26, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> > Tomas Vondra wrote:
> >> On 02/01/2017 11:52 PM, Alvaro Herrera wrote:
> >
> >> > Nearby, some auxiliary functions such as n_choose_k and
> >> > num_combinations are not documented. What it is that they do? I'd
> >> > move these at the end of the file, keeping the important entry points
> >> > at the top of the file.
> >>
> >> I'd say n-choose-k is pretty widely known term from combinatorics. The
> >> comment would essentially say just 'this is n-choose-k' which seems rather
> >> pointless. So as much as I dislike the self-documenting code, this actually
> >> seems like a good case of that.
> >
> > Actually, we do have such comments all over the place. I knew this as
> > "n sobre k", so the english name doesn't immediately ring a bell with me
> > until I look it up; I think the function comment could just say
> > "n_choose_k -- this function returns the binomial coefficient".
>
> One of the things you have to watch out for when writing code to
> compute binomial coefficients is integer overflow, since the numerator
> and denominator get large very quickly. For example, the current code
> will overflow for n=13, k=12, which really isn't that large.
>
> This can be avoided by computing the product in reverse and using a
> larger datatype like a 64-bit integer to store a single intermediate
> result. The point about multiplying the terms in reverse is that it
> guarantees that each intermediate result is an exact integer (a
> smaller binomial coefficient), so there is no need to track separate
> numerators and denominators, and you avoid huge intermediate
> factorials. Here's what that looks like in psuedo-code:
>
> binomial(int n, int k):
> # Save computational effort by using the symmetry of the binomial
> # coefficients
> k = min(k, n-k);
>
> # Compute the result using binomial(n, k) = binomial(n-1, k-1) * n / k,
> # starting from binomial(n-k, 0) = 1, and computing the sequence
> # binomial(n-k+1, 1), binomial(n-k+2, 2), ...
> #
> # Note that each intermediate result is an exact integer.
> int64 result = 1;
> for (int i = 1; i <= k; i++)
> {
> result = (result * (n-k+i)) / i;
> if (result > INT_MAX) Raise overflow error
> }
> return (int) result;
>
>
> Note also that I think num_combinations(n) is just an expensive way of
> calculating 2^n - n - 1.

Combinations are n!/(k! * (n-k)!), so computing those is more
along the lines of:

unsigned long long
choose(unsigned long long n, unsigned long long k) {
if (k > n) {
return 0;
}
unsigned long long r = 1;
for (unsigned long long d = 1; d <= k; ++d) {
r *= n--;
r /= d;
}
return r;
}

which greatly reduces the chance of overflow.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2017-02-08 16:15:35 Re: [PATCH] configure-time knob to set default ssl ciphers
Previous Message Andrew Dunstan 2017-02-08 16:07:51 Re: drop support for Python 2.3