Re: Weighted Stats

From: David Fetter <david(at)fetter(dot)org>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Weighted Stats
Date: 2016-03-20 22:38:40
Message-ID: 20160320223840.GA17918@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 19, 2016 at 05:04:08PM +0100, Tomas Vondra wrote:
> Hi,
>
> On 03/19/2016 07:34 AM, David Fetter wrote:
> >On Fri, Mar 18, 2016 at 06:12:12PM -0700, Jeff Janes wrote:
> >>On Tue, Mar 15, 2016 at 8:36 AM, David Fetter <david(at)fetter(dot)org> wrote:
> >>>
> >>>Please find attached a patch that uses the float8 version to cover the
> >>>numeric types.
> >>
> >>Is there a well-defined meaning for having a negative weight? If no,
> >>should it be disallowed?
> >
> >Opinions on this appear to vary. A Wikipedia article defines weights
> >as non-negative, while a manual to which it refers only uses non-zero.
> >
> >https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Mathematical_definition
> >https://www.gnu.org/software/gsl/manual/html_node/Weighted-Samples.html
>
> I don't think that actually allows negative weights. It says that
>
> w_i = 1/\sigma_i^2
>
> and variance is always > 0, thus w_i > 0. The zero is used as a special flag
> to remove the sample from the data set in a simple way.
>
> >I'm not sure which if either would be authoritative, but I could
> >certainly make up variants for each assumption.
> >
> >The assumption they have in common about weights is that a zero
> >weight is not part of the calculation, which assumption is
> >implemented in the previously submitted code.
>
> I think that if we're not sure what should happen with negative weights,
> then we should disallow them. It's easy to allow them later once we have a
> reasonable definition, but if we allow them now and later realize it should
> behave differently, we'll be in trouble because of breaking existing uses.

OK

> I can't really come up with a reasonable example that would actually use
> negative weights. Can you? That would probably help with defining the
> behavior correctly.

No, but I'm not a statistician. I've seen them mentioned in contexts
that appear to be discussions among same, and again opinions vary.

> Allowing negative weights has other consequences. For example, what if
> sum(W) ends up being 0? For example
>
> CREATE TABLE t (a float, b float);
> INSERT INTO t SELECT i, 1 FROM generate_series(1,1000) s(i);
> INSERT INTO t SELECT i, -1 FROM generate_series(1,1000) s(i);
>
> SELECT weighted_avg(a,b) FROM t;
> weighted_avg
> --------------
> NaN
> (1 row)
>
> Is that the correct behavior? Why?

It's not, and you're right.

I will send a patch that disallows negative weights this evening or
tomorrow. It will be slightly more complicated as I believe I will
need to create a new accumulator function for the weighted_avg() case
where I had been using an extant one before.

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yuri Niyazov 2016-03-20 22:40:59 pg_upgrade documentation improvement patch
Previous Message Fabien COELHO 2016-03-20 21:56:30 Re: extend pgbench expressions with functions