Re: Weighted Stats

From: David Fetter <david(at)fetter(dot)org>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Weighted Stats
Date: 2016-03-19 06:34:37
Message-ID: 20160319063437.GD1950@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 18, 2016 at 06:12:12PM -0700, Jeff Janes wrote:
> On Tue, Mar 15, 2016 at 8:36 AM, David Fetter <david(at)fetter(dot)org> wrote:
> >
> > Please find attached a patch that uses the float8 version to cover the
> > numeric types.
>
> Is there a well-defined meaning for having a negative weight? If no,
> should it be disallowed?

Opinions on this appear to vary. A Wikipedia article defines weights
as non-negative, while a manual to which it refers only uses non-zero.

https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Mathematical_definition
https://www.gnu.org/software/gsl/manual/html_node/Weighted-Samples.html

I'm not sure which if either would be authoritative, but I could
certainly make up variants for each assumption.

The assumption they have in common about weights is that a zero weight
is not part of the calculation, which assumption is implemented in the
previously submitted code.

> I don't know what I was expecting, but not this:
>
> select weighted_avg(x,10000000-2*x) from generate_series(1,10000000) f(x);
> weighted_avg
> ------------------
> 16666671666717.1

I'm guessing that negative weights can cause bizarre outcomes,
assuming it turns out we should allow them.

> Also, I think it might not give the correct answer even without
> negative weights:
>
> create table foo as select floor(random()*10000)::int val from
> generate_series(1,10000000);
>
> create table foo2 as select val, count(*) from foo group by val;
>
> Shouldn't these then give the same result:
>
> select stddev_samp(val) from foo;
> stddev_samp
> -------------------
> 2887.054977297105
>
> select weighted_stddev_samp(val,count) from foo2;
> weighted_stddev_samp
> ----------------------
> 2887.19919651336
>
> The 5th digit seems too early to be seeing round-off error.

Please pardon me if I've misunderstood, but you appear to be assuming
that

SELECT val, count(*) FROM foo GROUP BY val

will produce precisely identical count(*)s at each row, which it
overwhelmingly likely won't, producing the difference you see above.

What have I misunderstood?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-03-19 06:44:31 Re: Performance degradation in commit ac1d794
Previous Message Amit Kapila 2016-03-19 06:32:53 Re: Performance degradation in commit ac1d794