Re: Help needed structuring Postgresql correlation query

From: Alban Hertroys <haramrae(at)gmail(dot)com>
To: Tim Smith <randomdev4+postgres(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Help needed structuring Postgresql correlation query
Date: 2016-06-21 12:29:44
Message-ID: 6F703A96-FF6A-4967-9C11-BB198A06698D@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


> On 19 Jun 2016, at 10:58, Tim Smith <randomdev4+postgres(at)gmail(dot)com> wrote:
>
> Hi,
>
> My postgresql-fu is not good enough to write a query to achieve this
> (some may well say r is a better suited tool to achieve this !).
>
> I need to calculate what I would call a correlation window on a time
> series of data, my table looks like this :
>
> create table data(data_date date,data_measurement numeric);
> insert into data values('2016-01-01',16.23);
> <etc>
> insert into data values('2016-06-19',30.54);
>
> My "target sample" would be the N most recent samples in the table
> (e.g. 20, the most recent 20 days)
>
> My "potential sample" would be a moving window of size N (the same
> size N as above), starting at T0 (i.e. 2016-01-01 in this example) and
> incrementing by one (i.e. 2016-01-01 + 20, then 2016-01-02+20 etc),
> but the "target sample" would obviously be excluded.
>
> The output needs to display window date range (or at least the start
> date of the "potential sample" window) and the result
> corr(target,potential).
>
> Hope that makes sense

Something like this could do the trick (untested):

with recursive sample (nr, start_date) as (
select 1 as nr, data_date as start_date, SUM(data_measurement) as total
from generate_series(0, 19) range(step)
left join data on (data_date = start_date + range.step)

union all

select nr + 1, sample.start_date +1, SUM(data_measurement) as total
from sample
join generate_series(0, 19) range(step)
left join data on (data_date = start_date +1 + range.step)
where start_date +1 +19 <= (select MAX(data_date) from data)
group by 1, 2
)
select * from sample where start_date >= '2016-01-01';

Not sure how best to go about parameterising sample size N, a stored function seems like a good option.

Another approach would be to move a (cumulative) window-function with 20 items over your data set and for each row subtract the first value of the previous window from the total of the current window (that is, assuming you're calculating a SUM of data_measurement for each window of 20 records).

Visually that looks something like this for sample size 4:
sample 1: (A + B + C + D)
sample 2: (A + B + C + D) + E - A = (B + C + D + E)
sample 3: (B + C + D + E) + F - B = (C + D + E + F)
etc.

To accomplish this, you calculate two cumulative totals (often misnamed as running totals, but AFAIK that's something different), one from the start, and one lagging N rows behind (you can use the lag() window function for that) and subtract the two.

Good luck!

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Job 2016-06-21 15:59:18 R: R: Vacuum full: alternatives?
Previous Message Michael Paquier 2016-06-21 07:05:40 Re: Regression tests (Background Workers)