Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)

From: mariusz <marius(at)mtvk(dot)pl>
To: Ken Tanzer <ken(dot)tanzer(at)gmail(dot)com>
Cc: PG-General Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)
Date: 2018-02-27 09:16:22
Message-ID: 1519722982.32586.133.camel@mtvk.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, 2018-02-27 at 10:03 +0100, mariusz wrote:
> On Fri, 2018-02-23 at 18:11 -0800, Ken Tanzer wrote:
> > On Fri, Feb 23, 2018 at 6:10 AM,
> > mariusz <marius(at)mtvk(dot)pl> wrote:
> >
> >
> > i guess, you can easily get max continuous range for each row
> > with
> > something like this:
> >
> > CREATE OR REPLACE FUNCTION
> > append_daterange(d1 daterange, d2 daterange)
> > RETURNS daterange
> > LANGUAGE sql
> > AS
> > $$
> > SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2
> > ELSE d2 END;
> > $$;
> >
> > CREATE AGGREGATE agg_daterange (daterange) (
> > sfunc = append_daterange,
> > stype = daterange
> > );
> >
> > SELECT dr,
> > lower(agg_daterange(dr) OVER (ORDER BY dr ASC)),
> > upper(agg_daterange(dr) OVER (ORDER BY dr DESC))
> > FROM ...
> >
> > above example is simplified to selecting only daterange column
> > "dr" for
> > readability, which in your case should be something like
> >
> > daterange(staff_assign_date,staff_assign_date_end,'[)')
> >
> > please note that daterange would be normalized to [) format so
> > upper()
> > above returns exactly your max "staff_assign_date_end" for
> > each
> > continuous range when dateranges are created with '[)' format.
> >
> > the key point is ... ELSE d2 in append_daterange() which
> > starts with new
> > value each time that new value is discontinuous with agg's
> > state value
> > and order in which rows are processed (ASC for lower of
> > daterange, DESC
> > for upper of daterange).
> >
> > unfortunately this involves reading all rows for "client_id"
> > and
> > additional sorting for each window.
> > i recall reading that you already pay the price of reading all
> > rows for
> > client_id anyway, so the only question is the cost of two
> > additional
> > sorts (maybe extracting dateranges to subset on which to do
> > windows and
> > rejoining result of continuous ranges to original set would
> > help to
> > lower the cost).
> >
> >
> > Thank you, and I wanted to follow up on this. I couldn't quite get
> > your example working as described, but I also ended up trying
> > something very similar that got me very close but not quite there.
> > Basically, I can see that it is correctly calculating the ranges (in
> > the notices), but it is only returning the last range for each client.
> > (Because I said PARTITION BY client_id).
> >
> >
> sorry for late replay, i was offline from sat to mon inclusive.
> i may have previously added some confusion, so i'll try to explain what
> i had in my mind. see below what my suggested query is.
>
>
> > So I'm not sure if I should be calling this differently, or if the
> > function needs to work differently, or if this just isn't possible.
> > Do I need to partition by something else, and if so what? I don't see
> > what I could specify that would indicate a new episode.
> >
> definitely you want to partition by client_id if you are calculating
> this for multiple client_ids, but what matters here for each client_id
> is order of dateranges in over()
>
> >
> > Also, it's not clear to me how an aggregate might define/return
> > different values within a partition. Although this must be possible,
> > since functions like rank() and row_number() seem to do it.
> >
> that is because function is defined like agg, but what happens is that
> its current state value (return value of agg function) is returned for
> every row within given window without grouping the resultset
> >
> > Hoping there is something easy that can be tweaked here. See below
> > for copy/pastable test stuff. It includes output from both functions.
> > Both look to be returning the same results, which makes me wonder if
> > my passing in a start date was a waste of time, though it seems to me
> > it would be necessary.
> >
> > Cheers,
> > Ken
> >
> >
> > BEGIN;
> > CREATE TEMP TABLE sample_data (
> > client_id INTEGER,
> > start_date DATE,
> > end_date DATE,
> > episode INTEGER -- Just a label, for clarity
> > );
> >
> > INSERT INTO sample_data VALUES
> > (1,'1990-01-01','1990-12-31',1),
> > (1,'1991-01-01','1991-12-31',1),
> >
> > (1,'2000-01-01','2000-12-31',2),
> > (1,'2001-01-01','2001-12-31',2),
> > (1,'2002-01-01','2002-12-31',2),
> > (1,'2003-01-01','2003-12-31',2),
> > (1,'2004-01-01','2004-12-31',2),
> > (1,'2005-01-01','2005-12-31',2),
> > (1,'2006-01-01','2006-12-31',2),
> >
> > (1,'2014-01-01','2014-12-31',3),
> > (1,'2015-01-01','2015-12-31',3),
> > (1,'2017-06-30','2017-12-31',4),
> > (1,'2018-01-01',NULL,4),
> >
> > (2,'2014-02-01','2015-01-31',1),
> > (2,'2015-02-01','2015-12-31',1),
> > (2,'2017-09-30','2018-01-31',2),
> > (2,'2018-02-01','2018-02-14',2)
> > ;
> >
> > CREATE OR REPLACE FUNCTION
> > append_daterange(d1 daterange, d2 daterange)
> > RETURNS daterange
> > LANGUAGE sql
> > AS
> > $$
> > SELECT CASE WHEN d1 && d2 OR d1 -|- d2 THEN d1 + d2 ELSE d2
> > END;
> > $$;
> >
> > CREATE AGGREGATE agg_daterange (daterange) (
> > sfunc = append_daterange,
> > stype = daterange
> > );
> >
>
>
> based on your sample data see the query:
>
> select client_id, d,
> daterange(lower(test_agg_daterange(d)
> over (partition by client_id
> order by d)),
> upper(test_agg_daterange(d)
> over (partition by client_id
> order by d desc)),
> '[)'
> ) as cont_range
> from (select *, daterange(start_date,end_date,'[]') as d
> from sample_data
> ) q
> order by 1, 2;
>

that test_agg_daterange() is of course exactly the same as
agg_daterange() agg above, sorry for not cleaning enough my examples
lazily edited and copied from my history in psql

regards, mj

> client_id | d | cont_range
> -----------+-------------------------+-------------------------
> 1 | [1990-01-01,1991-01-01) | [1990-01-01,1992-01-01)
> 1 | [1991-01-01,1992-01-01) | [1990-01-01,1992-01-01)
> 1 | [2000-01-01,2001-01-01) | [2000-01-01,2007-01-01)
> 1 | [2001-01-01,2002-01-01) | [2000-01-01,2007-01-01)
> 1 | [2002-01-01,2003-01-01) | [2000-01-01,2007-01-01)
> 1 | [2003-01-01,2004-01-01) | [2000-01-01,2007-01-01)
> 1 | [2004-01-01,2005-01-01) | [2000-01-01,2007-01-01)
> 1 | [2005-01-01,2006-01-01) | [2000-01-01,2007-01-01)
> 1 | [2006-01-01,2007-01-01) | [2000-01-01,2007-01-01)
> 1 | [2014-01-01,2015-01-01) | [2014-01-01,2016-01-01)
> 1 | [2015-01-01,2016-01-01) | [2014-01-01,2016-01-01)
> 1 | [2017-06-30,2018-01-01) | [2017-06-30,)
> 1 | [2018-01-01,) | [2017-06-30,)
> 2 | [2014-02-01,2015-02-01) | [2014-02-01,2016-01-01)
> 2 | [2015-02-01,2016-01-01) | [2014-02-01,2016-01-01)
> 2 | [2017-09-30,2018-02-01) | [2017-09-30,2018-02-15)
> 2 | [2018-02-01,2018-02-15) | [2017-09-30,2018-02-15)
> (17 rows)
>
> and time to explain confusion i added previously regarding daterange
> input format (i'm working with something similar, but with dateranges in
> db, not add hoc generated from start and end dates).
> you need to generate daterange(start_date,end_date,'[]') for -|-
> operator to work correctly, but output of daterange will be normalized
> to '[)' format so it's up to you to extract end_date of continuous range
> (if you prefer dates over dateranges) which gonna be something like
> (upper(cont_range)-'1 day'::interval)::date
>
> above example is for showing how it works, returns max continuous range
> containing given row,
> you probably want:
> select distinct on (client_id,cont_range)
> or something like that to extract continuous ranges
>
> first window (within lower() func) extends its initial (earliest) range
> to right for every next row continuous with current range (and jumps to
> new start on discontinuity), thus over(order by d ASC)
> second window (within upper() func) extends its initial (latest) range
> to left for every next row continuous with current range (and jumps to
> new end on discontinuity), thus over(order by d DESC)
> partition by client_id within each over() is to not mix client_ids'
> dateranges
>
>
> i guess that is more or less what you wanted, at least as i understood
> you originally, if not than sorry for even more confusion or
> unnecessarily explaining simple and obvious things you might already
> know
>
> regards,
> mariusz jadczak
>
>
> > CREATE OR REPLACE FUNCTION range_continuous_merge( daterange,
> > daterange, date ) RETURNS daterange AS $$
> >
> > DECLARE
> > res daterange;
> >
> > BEGIN
> > res:= CASE
> > WHEN $1 IS NULL AND NOT $2 @> $3 THEN NULL
> > WHEN $1 IS NULL AND $2 @> $3 THEN $2
> > WHEN ($1 && $2) OR ($1 -|- $2) THEN $1 + $2
> > WHEN NOT $1 @> $3 THEN $2
> > ELSE $1
> > END;
> > RAISE NOTICE 'Inputs: %,%,%. Returning %',$1::text,$2::text,
> > $3::text,res;
> > RETURN res;
> > END;
> > $$ LANGUAGE plpgsql STABLE;
> >
> > CREATE AGGREGATE range_continuous( daterange, date ) (
> >
> > sfunc = range_continuous_merge,
> > stype = daterange
> > -- initcond = '{0,0,0}'
> >
> > );
> >
> > SELECT
> > client_id,episode,start_date,end_date,range_continuous(daterange(start_date,end_date,'[]'),start_date) OVER (PARTITION by client_id) FROM sample_data ;
> >
> > SELECT
> > client_id,episode,start_date,end_date,agg_daterange(daterange(start_date,end_date,'[]')) OVER (PARTITION by client_id) FROM sample_data ;
> >
> >
> >
> > -- RANGE_CONTINUOUS_MERGE
> >
> > NOTICE: Inputs: <NULL>,[1990-01-01,1991-01-01),1990-01-01. Returning
> > [1990-01-01,1991-01-01)
> > NOTICE: Inputs:
> > [1990-01-01,1991-01-01),[1991-01-01,1992-01-01),1991-01-01. Returning
> > [1990-01-01,1992-01-01)
> > NOTICE: Inputs:
> > [1990-01-01,1992-01-01),[2000-01-01,2001-01-01),2000-01-01. Returning
> > [2000-01-01,2001-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2001-01-01),[2001-01-01,2002-01-01),2001-01-01. Returning
> > [2000-01-01,2002-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2002-01-01),[2002-01-01,2003-01-01),2002-01-01. Returning
> > [2000-01-01,2003-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2003-01-01),[2003-01-01,2004-01-01),2003-01-01. Returning
> > [2000-01-01,2004-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2004-01-01),[2004-01-01,2005-01-01),2004-01-01. Returning
> > [2000-01-01,2005-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2005-01-01),[2005-01-01,2006-01-01),2005-01-01. Returning
> > [2000-01-01,2006-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2006-01-01),[2006-01-01,2007-01-01),2006-01-01. Returning
> > [2000-01-01,2007-01-01)
> > NOTICE: Inputs:
> > [2000-01-01,2007-01-01),[2014-01-01,2015-01-01),2014-01-01. Returning
> > [2014-01-01,2015-01-01)
> > NOTICE: Inputs:
> > [2014-01-01,2015-01-01),[2015-01-01,2016-01-01),2015-01-01. Returning
> > [2014-01-01,2016-01-01)
> > NOTICE: Inputs:
> > [2014-01-01,2016-01-01),[2017-06-30,2018-01-01),2017-06-30. Returning
> > [2017-06-30,2018-01-01)
> > NOTICE: Inputs: [2017-06-30,2018-01-01),[2018-01-01,),2018-01-01.
> > Returning [2017-06-30,)
> > NOTICE: Inputs: <NULL>,[2014-02-01,2015-02-01),2014-02-01. Returning
> > [2014-02-01,2015-02-01)
> > NOTICE: Inputs:
> > [2014-02-01,2015-02-01),[2015-02-01,2016-01-01),2015-02-01. Returning
> > [2014-02-01,2016-01-01)
> > NOTICE: Inputs:
> > [2014-02-01,2016-01-01),[2017-09-30,2018-02-01),2017-09-30. Returning
> > [2017-09-30,2018-02-01)
> > NOTICE: Inputs:
> > [2017-09-30,2018-02-01),[2018-02-01,2018-02-15),2018-02-01. Returning
> > [2017-09-30,2018-02-15)
> >
> > client_id | episode | start_date | end_date | range_continuous
> > -----------+---------+------------+------------+-------------------------
> > 1 | 1 | 1990-01-01 | 1990-12-31 | [2017-06-30,)
> > 1 | 1 | 1991-01-01 | 1991-12-31 | [2017-06-30,)
> > 1 | 2 | 2000-01-01 | 2000-12-31 | [2017-06-30,)
> > 1 | 2 | 2001-01-01 | 2001-12-31 | [2017-06-30,)
> > 1 | 2 | 2002-01-01 | 2002-12-31 | [2017-06-30,)
> > 1 | 2 | 2003-01-01 | 2003-12-31 | [2017-06-30,)
> > 1 | 2 | 2004-01-01 | 2004-12-31 | [2017-06-30,)
> > 1 | 2 | 2005-01-01 | 2005-12-31 | [2017-06-30,)
> > 1 | 2 | 2006-01-01 | 2006-12-31 | [2017-06-30,)
> > 1 | 3 | 2014-01-01 | 2014-12-31 | [2017-06-30,)
> > 1 | 3 | 2015-01-01 | 2015-12-31 | [2017-06-30,)
> > 1 | 4 | 2017-06-30 | 2017-12-31 | [2017-06-30,)
> > 1 | 4 | 2018-01-01 | | [2017-06-30,)
> > 2 | 1 | 2014-02-01 | 2015-01-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 1 | 2015-02-01 | 2015-12-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 2 | 2017-09-30 | 2018-01-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 2 | 2018-02-01 | 2018-02-14 |
> > [2017-09-30,2018-02-15)
> >
> > (17 rows)
> >
> >
> > -- AGG_DATERANGE
> >
> > client_id | episode | start_date | end_date | agg_daterange
> >
> > -----------+---------+------------+------------+-------------------------
> > 1 | 1 | 1990-01-01 | 1990-12-31 | [2017-06-30,)
> > 1 | 1 | 1991-01-01 | 1991-12-31 | [2017-06-30,)
> > 1 | 2 | 2000-01-01 | 2000-12-31 | [2017-06-30,)
> > 1 | 2 | 2001-01-01 | 2001-12-31 | [2017-06-30,)
> > 1 | 2 | 2002-01-01 | 2002-12-31 | [2017-06-30,)
> > 1 | 2 | 2003-01-01 | 2003-12-31 | [2017-06-30,)
> > 1 | 2 | 2004-01-01 | 2004-12-31 | [2017-06-30,)
> > 1 | 2 | 2005-01-01 | 2005-12-31 | [2017-06-30,)
> > 1 | 2 | 2006-01-01 | 2006-12-31 | [2017-06-30,)
> > 1 | 3 | 2014-01-01 | 2014-12-31 | [2017-06-30,)
> > 1 | 3 | 2015-01-01 | 2015-12-31 | [2017-06-30,)
> > 1 | 4 | 2017-06-30 | 2017-12-31 | [2017-06-30,)
> > 1 | 4 | 2018-01-01 | | [2017-06-30,)
> > 2 | 1 | 2014-02-01 | 2015-01-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 1 | 2015-02-01 | 2015-12-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 2 | 2017-09-30 | 2018-01-31 |
> > [2017-09-30,2018-02-15)
> > 2 | 2 | 2018-02-01 | 2018-02-14 |
> > [2017-09-30,2018-02-15)
> > (17 rows)
> >
> >
> >
> >
> >
> >
> > --
> >
> > AGENCY Software
> > A Free Software data system
> > By and for non-profits
> > http://agency-software.org/
> > https://demo.agency-software.org/client
> >
> > ken(dot)tanzer(at)agency-software(dot)org
> > (253) 245-3801
> >
> >
> > Subscribe to the mailing list to
> > learn more about AGENCY or
> > follow the discussion.
>
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message David Rowley 2018-02-27 09:17:07 Re: Why is tuple_percent so low?
Previous Message mariusz 2018-02-27 09:03:02 Re: Given a set of daterange, finding the continuous range that includes a particular date (aggregates)