Skip site navigation (1) Skip section navigation (2)

Advice on efficient approach to aggregation statistics

From: postgresql(dot)30(dot)miller_2555(at)spamgourmet(dot)com
To: pgsql-general(at)postgresql(dot)org
Subject: Advice on efficient approach to aggregation statistics
Date: 2010-11-12 16:51:45
Message-ID: AANLkTi=1ORc6ivJz5nqAquOzX+qmWP3qw-7pO=nBrdSM@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-general
Hi -

    I have a table for which I want to fit an exponential curve to values in
one column of the table and compute a correlation statistic on the fit of
the values to the exponential curve. The table consists of the following
three columns: {"ID","Date","Value"}. The desired result of the exercise is
to have a list of "ID"s, with the associated correlation statistic (based on
the "Value" column) for a subset of "Date"s. The "fitted" exponential curve
is not a true exponential regression for performance reasons. It uses the
exponential form: y = a * exp(b * x), but selects the constant values 'a' &
'b' using only the first and last "Values" (as sorted by "Date") -- so it's
really a best-fit exponential curve which is assured to pass through the
first and last sorted points. The following is a simplified example
(assuming the best practice solution involves the development of an
aggregate function):

Table (primary key is {"ID","Date"}):
    ID |     Date     | Value
    ----------------------------------
    A | 2010-11-12 | 48.0000
    A | 2010-11-11 | 24.0000
    A | 2010-11-10 | 12.0000
    A | 2010-11-09 |  6.0000
    A | 2010-11-08 |  3.0000
    A | 2010-11-07 |  1.5000
    B | 2010-11-12 | 33.1479
    B | 2010-11-11 | 19.5030
    B | 2010-11-10 | 11.4749
    B | 2010-11-09 |  6.7514
    B | 2010-11-08 |  3.9723
    B | 2010-11-07 |  2.3371

Hypothetical query:
    SELECT "ID", correl_exp("Value") AS "Corr"
    FROM Table
    WHERE "Date" >= '2010-11-09'
    GROUP BY "ID";

Expected output:
    ID | Corr
    ---------------
    A | 1.0000
    B | 0.8690

If attempting to reconstruct the simple example, the following are the
values for constants 'a' and 'b' for "ID"s A & B (also note that values for
dates 11/7 and 11/8 are ignored given the hypothetical query):
    A: a = 3.0000, b =  0.6931
    B: a = 1.4283, b =  0.9513


Questions:
    1) What is the best approach to accomplish the expected result?
        I would think the creation of an aggregate function would be the
"best" option (i.e. highest performance with little ongoing database
maintenence), but thought a nested SQL statement may also work.
    2) If the creation of an aggregate function is the best option,
        A) What is the highest performance language (or does it even
matter)? PL/C, PL/R, PL/pgSQL, other?  Can one use pre-complied shared
libraries?
        B) What is the best state function used to construct an array of
"Values", sorted by "Date" (as both the first and the last value need to be
known to compute the exponential curve)? I have heard array_append has
performance issues.

Many thanks -
Will

PostgreSQL v. 8.3.8 (expecting to upgrade to v.9.0.1 soon -- so best
practice solution should be compatable with v9.0.1 -- if best solution takes
advantage of improvements in v9.0.1 and is incompatable with v8.3.8, please
note and I can implement the solution after upgrade)

pgsql-general by date

Next:From: StodgeDate: 2010-11-12 19:49:22
Subject: ODBC Cursor inserting records appears to lock the database
Previous:From: Mark MitchellDate: 2010-11-12 16:25:43
Subject: Re: More then 1600 columns?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group