Re: Query to find contiguous ranges on a column

From: Tim Landscheidt <tim(at)tim-landscheidt(dot)de>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Query to find contiguous ranges on a column
Date: 2009-10-13 22:12:58
Message-ID: m3r5t754et.fsf@passepartout.tim-landscheidt.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Peter Hunsberger <peter(dot)hunsberger(at)gmail(dot)com> wrote:

> [...]
> I have one solution that joins the table against itself and does
> (among other things) a subselect looking "not exists col +1" and "not
> exists col -1" on the two instances of the table to find the start and
> end. This is, as you might guess, is not very efficient (my actual
> data is some 6 million+ rows) and I'm guessing there has to be
> something more efficient with windowing or possibly grouping on min
> and max (though I can't see how to make sure they are part of a
> contiguous set). Anyone have any ideas?

You can either use a PL/pgSQL function ("SETOF TEXT" just
for the convenience of the example):

| CREATE FUNCTION SummarizeRanges () RETURNS SETOF TEXT AS $$
| DECLARE
| CurrentFirst INT;
| CurrentLast INT;
| CurrentRecord RECORD;
| BEGIN
| FOR CurrentRecord IN SELECT col FROM t ORDER BY col LOOP
| IF CurrentFirst IS NULL THEN
| CurrentFirst := CurrentRecord.col;
| CurrentLast := CurrentRecord.col;
| ELSIF CurrentRecord.col = CurrentLast + 1 THEN
| CurrentLast := CurrentRecord.col;
| ELSE
| RETURN NEXT CurrentFirst || ', ' || CurrentLast;
| CurrentFirst := CurrentRecord.col;
| CurrentLast := CurrentRecord.col;
| END IF;
| END LOOP;
| IF CurrentFirst IS NOT NULL THEN
| RETURN NEXT CurrentFirst || ', ' || CurrentLast;
| END IF;
| RETURN;
| END;
| $$ LANGUAGE plpgsql;

or a recursive query (which I always find very hard to com-
prehend):

| WITH RECURSIVE RecCols (LeftBoundary, Value) AS
| (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t)
| UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE c.col = p.Value + 1)
| SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols
| GROUP BY LeftBoundary
| ORDER BY LeftBoundary;

Could you run both against your data set and find out which
one is faster for your six million rows?

Tim

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tim Landscheidt 2009-10-13 22:22:04 Re: Procedure for feature requests?
Previous Message Greg Smith 2009-10-13 21:44:38 Re: Re: [GENERAL] contrib/plantuner - enable PostgreSQL planner hints