Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)

From: mariusz <marius(at)mtvk(dot)pl>
To: Ken Tanzer <ken(dot)tanzer(at)gmail(dot)com>
Cc: PG-General Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)
Date: 2018-02-27 11:26:10
Message-ID: 1519730770.32586.156.camel@mtvk.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


hello,

one more fix, to not let someone get incorrect/incomplete ideas, see
below

On Tue, 2018-02-27 at 10:03 +0100, mariusz wrote:
> On Fri, 2018-02-23 at 18:11 -0800, Ken Tanzer wrote:
> > On Fri, Feb 23, 2018 at 6:10 AM,
> > mariusz <marius(at)mtvk(dot)pl> wrote:
> >
> >
> > i guess, you can easily get max continuous range for each row
> > with
> > something like this:
> >
> > CREATE OR REPLACE FUNCTION
> > append_daterange(d1 daterange, d2 daterange)
> > RETURNS daterange
> > LANGUAGE sql
> > AS
> > $$
> > SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2
> > ELSE d2 END;
> > $$;
> >
> > CREATE AGGREGATE agg_daterange (daterange) (
> > sfunc = append_daterange,
> > stype = daterange
> > );
> >
> > SELECT dr,
> > lower(agg_daterange(dr) OVER (ORDER BY dr ASC)),
> > upper(agg_daterange(dr) OVER (ORDER BY dr DESC))
> > FROM ...
> >
> > above example is simplified to selecting only daterange column
> > "dr" for
> > readability, which in your case should be something like
> >
> > daterange(staff_assign_date,staff_assign_date_end,'[)')
> >
> > please note that daterange would be normalized to [) format so
> > upper()
> > above returns exactly your max "staff_assign_date_end" for
> > each
> > continuous range when dateranges are created with '[)' format.
> >
> > the key point is ... ELSE d2 in append_daterange() which
> > starts with new
> > value each time that new value is discontinuous with agg's
> > state value
> > and order in which rows are processed (ASC for lower of
> > daterange, DESC
> > for upper of daterange).
> >
> > unfortunately this involves reading all rows for "client_id"
> > and
> > additional sorting for each window.
> > i recall reading that you already pay the price of reading all
> > rows for
> > client_id anyway, so the only question is the cost of two
> > additional
> > sorts (maybe extracting dateranges to subset on which to do
> > windows and
> > rejoining result of continuous ranges to original set would
> > help to
> > lower the cost).
> >
> >
> > Thank you, and I wanted to follow up on this. I couldn't quite get
> > your example working as described, but I also ended up trying
> > something very similar that got me very close but not quite there.
> > Basically, I can see that it is correctly calculating the ranges (in
> > the notices), but it is only returning the last range for each client.
> > (Because I said PARTITION BY client_id).
> >
> >
> sorry for late replay, i was offline from sat to mon inclusive.
> i may have previously added some confusion, so i'll try to explain what
> i had in my mind. see below what my suggested query is.
>
>
> > So I'm not sure if I should be calling this differently, or if the
> > function needs to work differently, or if this just isn't possible.
> > Do I need to partition by something else, and if so what? I don't see
> > what I could specify that would indicate a new episode.
> >
> definitely you want to partition by client_id if you are calculating
> this for multiple client_ids, but what matters here for each client_id
> is order of dateranges in over()
>
> >
> > Also, it's not clear to me how an aggregate might define/return
> > different values within a partition. Although this must be possible,
> > since functions like rank() and row_number() seem to do it.
> >
> that is because function is defined like agg, but what happens is that
> its current state value (return value of agg function) is returned for
> every row within given window without grouping the resultset
> >
> > Hoping there is something easy that can be tweaked here. See below
> > for copy/pastable test stuff. It includes output from both functions.
> > Both look to be returning the same results, which makes me wonder if
> > my passing in a start date was a waste of time, though it seems to me
> > it would be necessary.
> >
> > Cheers,
> > Ken
> >
> >
> > BEGIN;
> > CREATE TEMP TABLE sample_data (
> > client_id INTEGER,
> > start_date DATE,
> > end_date DATE,
> > episode INTEGER -- Just a label, for clarity
> > );
> >
> > INSERT INTO sample_data VALUES
> > (1,'1990-01-01','1990-12-31',1),
> > (1,'1991-01-01','1991-12-31',1),
> >
> > (1,'2000-01-01','2000-12-31',2),
> > (1,'2001-01-01','2001-12-31',2),
> > (1,'2002-01-01','2002-12-31',2),
> > (1,'2003-01-01','2003-12-31',2),
> > (1,'2004-01-01','2004-12-31',2),
> > (1,'2005-01-01','2005-12-31',2),
> > (1,'2006-01-01','2006-12-31',2),
> >
> > (1,'2014-01-01','2014-12-31',3),
> > (1,'2015-01-01','2015-12-31',3),
> > (1,'2017-06-30','2017-12-31',4),
> > (1,'2018-01-01',NULL,4),
> >
> > (2,'2014-02-01','2015-01-31',1),
> > (2,'2015-02-01','2015-12-31',1),
> > (2,'2017-09-30','2018-01-31',2),
> > (2,'2018-02-01','2018-02-14',2)
> > ;
> >
> > CREATE OR REPLACE FUNCTION
> > append_daterange(d1 daterange, d2 daterange)
> > RETURNS daterange
> > LANGUAGE sql
> > AS
> > $$
> > SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2 ELSE d2
> > END;
> > $$;
> >
> > CREATE AGGREGATE agg_daterange (daterange) (
> > sfunc = append_daterange,
> > stype = daterange
> > );
> >
>
>
> based on your sample data see the query:
>
> select client_id, d,
> daterange(lower(test_agg_daterange(d)
> over (partition by client_id
> order by d)),
> upper(test_agg_daterange(d)
> over (partition by client_id
> order by d desc)),
> '[)'
> ) as cont_range
> from (select *, daterange(start_date,end_date,'[]') as d
> from sample_data
> ) q
> order by 1, 2;

that should work with your non-overlapping ranges, should work with
overlapping ranges except for ranges strictly included within another.
that might not apply to your case, but for safety (we can't say who
gonna read this and get some ideas), i feel it's better to correct
second window, that within upper() to:

OVER (PARTITION BY client_id ORDER BY end_date DESC)

likewise, first window (within lower()) could be sorted by start_date
asc, but that's cosmetic change, while for upper bound of range it does
matter

regards, mj

> client_id | d | cont_range
> -----------+-------------------------+-------------------------
> 1 | [1990-01-01,1991-01-01) | [1990-01-01,1992-01-01)
> 1 | [1991-01-01,1992-01-01) | [1990-01-01,1992-01-01)
> 1 | [2000-01-01,2001-01-01) | [2000-01-01,2007-01-01)
> 1 | [2001-01-01,2002-01-01) | [2000-01-01,2007-01-01)
> 1 | [2002-01-01,2003-01-01) | [2000-01-01,2007-01-01)
> 1 | [2003-01-01,2004-01-01) | [2000-01-01,2007-01-01)
> 1 | [2004-01-01,2005-01-01) | [2000-01-01,2007-01-01)
> 1 | [2005-01-01,2006-01-01) | [2000-01-01,2007-01-01)
> 1 | [2006-01-01,2007-01-01) | [2000-01-01,2007-01-01)
> 1 | [2014-01-01,2015-01-01) | [2014-01-01,2016-01-01)
> 1 | [2015-01-01,2016-01-01) | [2014-01-01,2016-01-01)
> 1 | [2017-06-30,2018-01-01) | [2017-06-30,)
> 1 | [2018-01-01,) | [2017-06-30,)
> 2 | [2014-02-01,2015-02-01) | [2014-02-01,2016-01-01)
> 2 | [2015-02-01,2016-01-01) | [2014-02-01,2016-01-01)
> 2 | [2017-09-30,2018-02-01) | [2017-09-30,2018-02-15)
> 2 | [2018-02-01,2018-02-15) | [2017-09-30,2018-02-15)
> (17 rows)
>
> and time to explain confusion i added previously regarding daterange
> input format (i'm working with something similar, but with dateranges in
> db, not add hoc generated from start and end dates).
> you need to generate daterange(start_date,end_date,'[]') for -|-
> operator to work correctly, but output of daterange will be normalized
> to '[)' format so it's up to you to extract end_date of continuous range
> (if you prefer dates over dateranges) which gonna be something like
> (upper(cont_range)-'1 day'::interval)::date
>
> above example is for showing how it works, returns max continuous range
> containing given row,
> you probably want:
> select distinct on (client_id,cont_range)
> or something like that to extract continuous ranges
>
> first window (within lower() func) extends its initial (earliest) range
> to right for every next row continuous with current range (and jumps to
> new start on discontinuity), thus over(order by d ASC)
> second window (within upper() func) extends its initial (latest) range
> to left for every next row continuous with current range (and jumps to
> new end on discontinuity), thus over(order by d DESC)
> partition by client_id within each over() is to not mix client_ids'
> dateranges
>
>
> i guess that is more or less what you wanted, at least as i understood
> you originally, if not than sorry for even more confusion or
> unnecessarily explaining simple and obvious things you might already
> know
>
> regards,
> mariusz jadczak
>
>
> > CREATE OR REPLACE FUNCTION range_continuous_merge( daterange,
> > daterange, date ) RETURNS daterange AS $$
> >
> > DECLARE
> > res daterange;
> >
> > BEGIN
> > res:= CASE
> > WHEN $1 IS NULL AND NOT $2 @> $3 THEN NULL
> > WHEN $1 IS NULL AND $2 @> $3 THEN $2
> > WHEN ($1 && $2) OR ($1 -|- $2) THEN $1 + $2
> > WHEN NOT $1 @> $3 THEN $2
> > ELSE $1
> > END;
> > RAISE NOTICE 'Inputs: %,%,%. Returning %',$1::text,$2::text,
> > $3::text,res;
> > RETURN res;
> > END;
> > $$ LANGUAGE plpgsql STABLE;
> >
> > CREATE AGGREGATE range_continuous( daterange, date ) (
> >
> > sfunc = range_continuous_merge,
> > stype = daterange
> > -- initcond = '{0,0,0}'
> >
> > );
> >
> > SELECT
> > client_id,episode,start_date,end_date,range_continuous(daterange(start_date,end_date,'[]'),start_date) OVER (PARTITION by client_id) FROM sample_data ;
> >
> > SELECT
> > client_id,episode,start_date,end_date,agg_daterange(daterange(start_date,end_date,'[]')) OVER (PARTITION by client_id) FROM sample_data ;
> >
> >
> >
> > -- RANGE_CONTINUOUS_MERGE
> >
> > NOTICE: Inputs: <NULL>,[1990-01-01,1991-01-01),1990-01-01. Returning
> > [1990-01-01,1991-01-01)
> > NOTICE: Inputs:
> > [1990-01-01,1991-01-01),[1991-01-01,1992-01-01),1991-01-01. Returning
> > [1990-01-01,1992-01-01)
> > NOTICE: Inputs:
> > [1990-01-01,1992-01-01),[2000-01-01,2001-01-01),2000-01-01. Returning
> > [2000-01-01,2001-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2001-01-01),[2001-01-01,2002-01-01),2001-01-01. Returning
> > [2000-01-01,2002-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2002-01-01),[2002-01-01,2003-01-01),2002-01-01. Returning
> > [2000-01-01,2003-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2003-01-01),[2003-01-01,2004-01-01),2003-01-01. Returning
> > [2000-01-01,2004-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2004-01-01),[2004-01-01,2005-01-01),2004-01-01. Returning
> > [2000-01-01,2005-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2005-01-01),[2005-01-01,2006-01-01),2005-01-01. Returning
> > [2000-01-01,2006-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2006-01-01),[2006-01-01,2007-01-01),2006-01-01. Returning
> > [2000-01-01,2007-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2007-01-01),[2014-01-01,2015-01-01),2014-01-01. Returning
> > [2014-01-01,2015-01-01)
> > NOTICE: Inputs:
> > [2014-01-01,2015-01-01),[2015-01-01,2016-01-01),2015-01-01. Returning
> > [2014-01-01,2016-01-01)
> > NOTICE: Inputs:
> > [2014-01-01,2016-01-01),[2017-06-30,2018-01-01),2017-06-30. Returning
> > [2017-06-30,2018-01-01)
> > NOTICE: Inputs: [2017-06-30,2018-01-01),[2018-01-01,),2018-01-01.
> > Returning [2017-06-30,)
> > NOTICE: Inputs: <NULL>,[2014-02-01,2015-02-01),2014-02-01. Returning
> > [2014-02-01,2015-02-01)
> > NOTICE: Inputs:
> > [2014-02-01,2015-02-01),[2015-02-01,2016-01-01),2015-02-01. Returning
> > [2014-02-01,2016-01-01)
> > NOTICE: Inputs:
> > [2014-02-01,2016-01-01),[2017-09-30,2018-02-01),2017-09-30. Returning
> > [2017-09-30,2018-02-01)
> > NOTICE: Inputs:
> > [2017-09-30,2018-02-01),[2018-02-01,2018-02-15),2018-02-01. Returning
> > [2017-09-30,2018-02-15)
> >
> > client_id | episode | start_date | end_date | range_continuous
> > -----------+---------+------------+------------+-------------------------
> > 1 | 1 | 1990-01-01 | 1990-12-31 | [2017-06-30,)
> > 1 | 1 | 1991-01-01 | 1991-12-31 | [2017-06-30,)
> > 1 | 2 | 2000-01-01 | 2000-12-31 | [2017-06-30,)
> > 1 | 2 | 2001-01-01 | 2001-12-31 | [2017-06-30,)
> > 1 | 2 | 2002-01-01 | 2002-12-31 | [2017-06-30,)
> > 1 | 2 | 2003-01-01 | 2003-12-31 | [2017-06-30,)
> > 1 | 2 | 2004-01-01 | 2004-12-31 | [2017-06-30,)
> > 1 | 2 | 2005-01-01 | 2005-12-31 | [2017-06-30,)
> > 1 | 2 | 2006-01-01 | 2006-12-31 | [2017-06-30,)
> > 1 | 3 | 2014-01-01 | 2014-12-31 | [2017-06-30,)
> > 1 | 3 | 2015-01-01 | 2015-12-31 | [2017-06-30,)
> > 1 | 4 | 2017-06-30 | 2017-12-31 | [2017-06-30,)
> > 1 | 4 | 2018-01-01 | | [2017-06-30,)
> > 2 | 1 | 2014-02-01 | 2015-01-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 1 | 2015-02-01 | 2015-12-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 2 | 2017-09-30 | 2018-01-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 2 | 2018-02-01 | 2018-02-14 |
> > [2017-09-30,2018-02-15)
> >
> > (17 rows)
> >
> >
> > -- AGG_DATERANGE
> >
> > client_id | episode | start_date | end_date | agg_daterange
> >
> > -----------+---------+------------+------------+-------------------------
> > 1 | 1 | 1990-01-01 | 1990-12-31 | [2017-06-30,)
> > 1 | 1 | 1991-01-01 | 1991-12-31 | [2017-06-30,)
> > 1 | 2 | 2000-01-01 | 2000-12-31 | [2017-06-30,)
> > 1 | 2 | 2001-01-01 | 2001-12-31 | [2017-06-30,)
> > 1 | 2 | 2002-01-01 | 2002-12-31 | [2017-06-30,)
> > 1 | 2 | 2003-01-01 | 2003-12-31 | [2017-06-30,)
> > 1 | 2 | 2004-01-01 | 2004-12-31 | [2017-06-30,)
> > 1 | 2 | 2005-01-01 | 2005-12-31 | [2017-06-30,)
> > 1 | 2 | 2006-01-01 | 2006-12-31 | [2017-06-30,)
> > 1 | 3 | 2014-01-01 | 2014-12-31 | [2017-06-30,)
> > 1 | 3 | 2015-01-01 | 2015-12-31 | [2017-06-30,)
> > 1 | 4 | 2017-06-30 | 2017-12-31 | [2017-06-30,)
> > 1 | 4 | 2018-01-01 | | [2017-06-30,)
> > 2 | 1 | 2014-02-01 | 2015-01-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 1 | 2015-02-01 | 2015-12-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 2 | 2017-09-30 | 2018-01-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 2 | 2018-02-01 | 2018-02-14 |
> > [2017-09-30,2018-02-15)
> > (17 rows)
> >
> >
> >
> >
> >
> >
> > --
> >
> > AGENCY Software
> > A Free Software data system
> > By and for non-profits
> > http://agency-software.org/
> > https://demo.agency-software.org/client
> >
> > ken(dot)tanzer(at)agency-software(dot)org
> > (253) 245-3801
> >
> >
> > Subscribe to the mailing list to
> > learn more about AGENCY or
> > follow the discussion.
>
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Greg Stark 2018-02-27 11:40:04 Re: Feature request: separate logging
Previous Message Christian Keil 2018-02-27 09:55:33 Re: Creating complex track changes database - challenge!