Advice on efficient approach to aggregation statistics

From: postgresql(dot)30(dot)miller_2555(at)spamgourmet(dot)com
To: pgsql-general(at)postgresql(dot)org
Subject: Advice on efficient approach to aggregation statistics
Date: 2010-11-12 16:51:45
Message-ID: AANLkTi=1ORc6ivJz5nqAquOzX+qmWP3qw-7pO=nBrdSM@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi -

I have a table for which I want to fit an exponential curve to values in
one column of the table and compute a correlation statistic on the fit of
the values to the exponential curve. The table consists of the following
three columns: {"ID","Date","Value"}. The desired result of the exercise is
to have a list of "ID"s, with the associated correlation statistic (based on
the "Value" column) for a subset of "Date"s. The "fitted" exponential curve
is not a true exponential regression for performance reasons. It uses the
exponential form: y = a * exp(b * x), but selects the constant values 'a' &
'b' using only the first and last "Values" (as sorted by "Date") -- so it's
really a best-fit exponential curve which is assured to pass through the
first and last sorted points. The following is a simplified example
(assuming the best practice solution involves the development of an
aggregate function):

Table (primary key is {"ID","Date"}):
ID | Date | Value
----------------------------------
A | 2010-11-12 | 48.0000
A | 2010-11-11 | 24.0000
A | 2010-11-10 | 12.0000
A | 2010-11-09 | 6.0000
A | 2010-11-08 | 3.0000
A | 2010-11-07 | 1.5000
B | 2010-11-12 | 33.1479
B | 2010-11-11 | 19.5030
B | 2010-11-10 | 11.4749
B | 2010-11-09 | 6.7514
B | 2010-11-08 | 3.9723
B | 2010-11-07 | 2.3371

Hypothetical query:
SELECT "ID", correl_exp("Value") AS "Corr"
FROM Table
WHERE "Date" >= '2010-11-09'
GROUP BY "ID";

Expected output:
ID | Corr
---------------
A | 1.0000
B | 0.8690

If attempting to reconstruct the simple example, the following are the
values for constants 'a' and 'b' for "ID"s A & B (also note that values for
dates 11/7 and 11/8 are ignored given the hypothetical query):
A: a = 3.0000, b = 0.6931
B: a = 1.4283, b = 0.9513

Questions:
1) What is the best approach to accomplish the expected result?
I would think the creation of an aggregate function would be the
"best" option (i.e. highest performance with little ongoing database
maintenence), but thought a nested SQL statement may also work.
2) If the creation of an aggregate function is the best option,
A) What is the highest performance language (or does it even
matter)? PL/C, PL/R, PL/pgSQL, other? Can one use pre-complied shared
libraries?
B) What is the best state function used to construct an array of
"Values", sorted by "Date" (as both the first and the last value need to be
known to compute the exponential curve)? I have heard array_append has
performance issues.

Many thanks -
Will

PostgreSQL v. 8.3.8 (expecting to upgrade to v.9.0.1 soon -- so best
practice solution should be compatable with v9.0.1 -- if best solution takes
advantage of improvements in v9.0.1 and is incompatable with v8.3.8, please
note and I can implement the solution after upgrade)

Browse pgsql-general by date

  From Date Subject
Next Message Stodge 2010-11-12 19:49:22 ODBC Cursor inserting records appears to lock the database
Previous Message Mark Mitchell 2010-11-12 16:25:43 Re: More then 1600 columns?