Re: Select ranges based on sequential breaks

From: Joel Nothman <jnothman(at)student(dot)usyd(dot)edu(dot)au>
To: Mike Toews <mwtoews(at)sfu(dot)ca>
Cc: David Wilson <david(dot)t(dot)wilson(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Select ranges based on sequential breaks
Date: 2009-06-22 14:34:49
Message-ID: befabc340906220734u3afcbfffg33c357c71e424552@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Mike,

I happened upon your query, which is related to some stuff I've been
playing with.

Firstly, David's solution below doesn't work. I haven't yet tried to work
out why.

Secondly, I was hoping to be able to solve your problem nicely with
Postgres 8.4's window functions [1,2], which provide functions relating.

Given the following setup:
CREATE TABLE foo (dt date, bin varchar(4));
INSERT INTO foo VALUES ('2009-01-01', 'red'), ('2009-01-02', 'red'),
('2009-01-03', 'blue'), ('2009-01-04', 'blue'), ('2009-01-05', 'blue'),
('2009-01-06', 'red'), ('2009-01-07', 'blue'), ('2009-01-08', 'blue'),
('2009-01-09', 'red'), ('2009-01-10', 'red');

I had hoped the following would suffice:
SELECT first_value(dt) OVER w, last_value(dt) OVER w, bin FROM foo WINDOW
w AS (ORDER BY dt PARTITION BY bin);

Apparently, this is bad syntax. ORDER BY cannot precede PARTITION BY in a
WINDOW declaration, and yet I wanted a grouping of date-consecutive bins,
which (PARTITION BY bin ORDER BY dt) would not give me.

I was able to produce the required result with:

SELECT MIN(dt) AS first, MAX(dt) AS last, MAX(bin) AS bin
FROM (SELECT dt, bin,
CASE WHEN newbin IS NULL
THEN 0
ELSE SUM(newbin) OVER (ORDER BY dt)
END AS binno
FROM (SELECT *,
(bin != lag(bin, 1) OVER (ORDER BY dt))::int AS
newbin
FROM foo
) AS newbins
) AS binnos
GROUP BY binno
ORDER BY first;

This relies on a middle step in which I create an enumeration of the bins
in sequence:
SELECT dt, bin, CASE WHEN newbin IS NULL THEN 0 ELSE sum(newbin) OVER
(ORDER BY dt) END AS binno FROM (SELECT *, (bin != lag(bin, 1) OVER (ORDER
BY dt))::int AS newbin FROM foo) AS newbins;
dt | bin | binno
------------+------+-------
2009-01-01 | red | 0
2009-01-02 | red | 0
2009-01-03 | blue | 1
2009-01-04 | blue | 1
2009-01-05 | blue | 1
2009-01-06 | red | 2
2009-01-07 | blue | 3
2009-01-08 | blue | 3
2009-01-09 | red | 4
2009-01-10 | red | 4

I would hope there is a neater way to do this with window functions.

The best way to solve your problem may be with PL/SQL, which is also good
at dealing with sequences (it's not as complicated as it looks!):

CREATE TYPE bindates AS (first date, last date, bin varchar(5));
CREATE OR REPLACE FUNCTION bindates()
RETURNS SETOF bindates
AS $$
DECLARE
row record;
res bindates;
BEGIN
FOR row IN SELECT * FROM foo ORDER BY dt
LOOP
IF res.bin IS NULL OR res.bin != row.bin THEN
IF res.bin IS NOT NULL THEN
RETURN NEXT res;
END IF;
res.bin := row.bin;
res.first := row.dt;
END IF;
res.last := row.dt;
END LOOP;
IF res.bin IS NOT NULL THEN
RETURN NEXT res;
END IF;
END;
$$ LANGUAGE plpgsql;

Finally, I'll try to sort out David's solution. Once we correct his typo
(t1.order -> t1.date) and add an 'ORDER BY first' to the end, we get:
first | last | bin
------------+------------+------
2009-01-03 | 2009-01-05 | blue
2009-01-06 | 2009-01-06 | red
2009-01-07 | 2009-01-08 | blue
2009-01-09 | | red

This includes correct output, but it fails on both edge cases. The
non-appearance of the first row is due to the WHERE clause on the main
SELECT statement:
WHERE (SELECT bin
FROM foo t2
WHERE t2.dt < t1.dt
ORDER BY dt DESC LIMIT 1) <> t1.bin

If we drop this WHERE clause, we get:
first | last | bin
------------+------------+------
2009-01-01 | 2009-01-02 | red
2009-01-02 | 2009-01-02 | red
2009-01-03 | 2009-01-05 | blue
2009-01-04 | 2009-01-05 | blue
2009-01-05 | 2009-01-05 | blue
2009-01-06 | 2009-01-06 | red
2009-01-07 | 2009-01-08 | blue
2009-01-08 | 2009-01-08 | blue
2009-01-09 | | red
2009-01-10 | | red

We can therefore get the result including the first row by selecting from
this table with 'GROUP BY last, bin'.

And we can hack in a value for those final NULLs as a special case. The
following statement works:

SELECT MIN(first),
CASE WHEN last IS NULL THEN (SELECT MAX(dt) FROM foo) ELSE last
END,
bin
FROM (
SELECT dt AS first,
(SELECT dt
FROM foo t3
WHERE t3.dt < (
SELECT dt
FROM foo t5
WHERE t5.dt > t1.dt
AND t5.bin <> t1.bin
ORDER BY dt ASC
LIMIT 1)
ORDER BY dt DESC
LIMIT 1) AS last,
bin
FROM foo t1
) t0
GROUP BY last, bin
ORDER BY last;

Finally, what's efficient? With 1,000,000 random rows, we get:
Enumeration: 13s
PL/SQL: 12s
Modified David: minutes.

[I used the following to create test data:
CREATE OR REPLACE FUNCTION make_random(n int) RETURNS SETOF foo AS $$
import random
for i in xrange(n):
m = random.randint(1,12)
d = random.randint(1,28)
b = random.choice(('red', 'blue'))
yield '2009-%d-%d' % (m, d), b
$$ LANGUAGE plpythonu;
DELETE FROM foo; INSERT INTO foo (SELECT * FROM make_random(1000000));]

I hope that helps you in considering various approaches to the problem.

- Joel

[1] http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html
[2] http://developer.postgresql.org/pgdocs/postgres/functions-window.html

On Tue, 16 Jun 2009 06:06:16 +1000, David Wilson
<david(dot)t(dot)wilson(at)gmail(dot)com> wrote:

> On Mon, Jun 15, 2009 at 2:23 PM, Mike Toews<mwtoews(at)sfu(dot)ca> wrote:
>> Hi,
>>
>> I'm having difficulty constructing a query that will find breaks where
>> data
>> change in a time-series. I've done some searching for this too, but I
>> haven't found anything.
>>
>> Here is my example situation, consider my source table:
>> date    bin
>> 2009-01-01      red
>> 2009-01-02      red
>> 2009-01-03      blue
>> 2009-01-04      blue
>> 2009-01-05      blue
>> 2009-01-06      red
>> 2009-01-07      blue
>> 2009-01-08      blue
>> 2009-01-09      red
>> 2009-01-10      red
>>
>>
>> I would like to get the first and last of each consecutive series based
>> on
>> column "bin". My result for the table would look like:
>> first   last    bin
>> 2009-01-01      2009-01-02      red
>> 2009-01-03      2009-01-05      blue
>> 2009-01-06      2009-01-06      red
>> 2009-01-07      2009-01-08      blue
>> 2009-01-09      2009-01-10      red
>>
>>
>> This is easy to compute using a spreadsheet or in R, but how would I do
>> this
>> with SQL? I'm using 8.3. Advice is appreciated.
>
> (Written in email and untested- also, someone will probably provide a
> better way, I hope, but this should at least work)
>
> select date as first,
> (select date from table t3 where t3.date<(select date from table t5
> where t5.date>t1.date and t5.bin<>t1.bin order by date asc limit 1)
> order by date desc limit 1) as last,
> bin
> from table t1 where (select bin from table t2 where t2.date<t1.order
> order by date desc limit 1)<>t1.bin;
>
> Ugly, and I'm pretty sure there's a much better way, but my brain is
> failing me right now- hopefully this'll at least get you started,
> though.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Joel Nothman 2009-06-22 14:45:45 Re: Select ranges based on sequential breaks
Previous Message Tom Lane 2009-06-22 13:48:03 Re: Killing a data modifying transaction