Re: multivariate statistics (v19)

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2017-02-08 15:23:25
Message-ID: CAEZATCXagY1GjO=jgK3-1z_gKaFFYkiTNRUyOZscOT1M-DJiUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6 February 2017 at 21:26, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> Tomas Vondra wrote:
>> On 02/01/2017 11:52 PM, Alvaro Herrera wrote:
>
>> > Nearby, some auxiliary functions such as n_choose_k and
>> > num_combinations are not documented. What it is that they do? I'd
>> > move these at the end of the file, keeping the important entry points
>> > at the top of the file.
>>
>> I'd say n-choose-k is pretty widely known term from combinatorics. The
>> comment would essentially say just 'this is n-choose-k' which seems rather
>> pointless. So as much as I dislike the self-documenting code, this actually
>> seems like a good case of that.
>
> Actually, we do have such comments all over the place. I knew this as
> "n sobre k", so the english name doesn't immediately ring a bell with me
> until I look it up; I think the function comment could just say
> "n_choose_k -- this function returns the binomial coefficient".
>

One of the things you have to watch out for when writing code to
compute binomial coefficients is integer overflow, since the numerator
and denominator get large very quickly. For example, the current code
will overflow for n=13, k=12, which really isn't that large.

This can be avoided by computing the product in reverse and using a
larger datatype like a 64-bit integer to store a single intermediate
result. The point about multiplying the terms in reverse is that it
guarantees that each intermediate result is an exact integer (a
smaller binomial coefficient), so there is no need to track separate
numerators and denominators, and you avoid huge intermediate
factorials. Here's what that looks like in psuedo-code:

binomial(int n, int k):
# Save computational effort by using the symmetry of the binomial
# coefficients
k = min(k, n-k);

# Compute the result using binomial(n, k) = binomial(n-1, k-1) * n / k,
# starting from binomial(n-k, 0) = 1, and computing the sequence
# binomial(n-k+1, 1), binomial(n-k+2, 2), ...
#
# Note that each intermediate result is an exact integer.
int64 result = 1;
for (int i = 1; i <= k; i++)
{
result = (result * (n-k+i)) / i;
if (result > INT_MAX) Raise overflow error
}
return (int) result;

Note also that I think num_combinations(n) is just an expensive way of
calculating 2^n - n - 1.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Devrim Gündüz 2017-02-08 15:35:15 Re: drop support for Python 2.3
Previous Message David Fetter 2017-02-08 15:12:14 Re: One-shot expanded output in psql using \gx