Re: multivariate statistics (v19)

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: David Fetter <david(at)fetter(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2017-02-12 10:35:04
Message-ID: CAEZATCXn1jJSAR0RYtki3ByFiDnkCGZTqvp7GT_+oAj8cmWDvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11 February 2017 at 01:17, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Thanks for the feedback, I'll fix this. I've allowed myself to be a bit
> sloppy because the number of attributes in the statistics is currently
> limited to 8, so the overflows are currently not an issue. But it doesn't
> hurt to make it future-proof, in case we change that mostly artificial limit
> sometime in the future.
>

Ah right, so it can't overflow at present, but it's neater to have an
overflow-proof algorithm.

Thinking about the exactness of the division steps is quite
interesting. Actually, the order of the multiplying factors doesn't
matter as long as the divisors are in increasing order. So in both my
proposal:

result = 1
for (i = 1; i <= k; i++)
result = (result * (n-k+i)) / i;

and David's proposal, which is equivalent but has the multiplying
factors in the opposite order, equivalent to:

result = 1
for (i = 1; i <= k; i++)
result = (result * (n-i+1)) / i;

the divisions are exact at each step. The first time through the loop
it divides by 1 which is trivially exact. The second time it divides
by 2, having multiplied by 2 consecutive factors, one of which is
therefore guaranteed to be divisible by 2. The third time it divides
by 3, having multiplied by 3 consecutive factors, one of which is
therefore guaranteed to be divisible by 3, and so on.

My approach originally seemed more logical to me because of the way it
derives from the recurrence relation binomial(n, k) = binomial(n-1,
k-1) * n / k, but they both work fine as long as they have suitable
overflow checks.

It's also interesting that descriptions of this algorithm tend to talk
about setting k to min(k, n-k) at the start as an optimisation step,
as I did in fact, whereas it's actually more than that -- it helps
prevent unnecessary intermediate overflows when k > n/2. Of course,
that's not a worry for the current use of this function, but it's good
to have a robust algorithm.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2017-02-12 11:51:26 Re: Improve OR conditions on joined columns (common star schema problem)
Previous Message Fabien COELHO 2017-02-12 08:07:03 Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)