Re: Measure Theoretic Data Types in Postgresql

From: Aaron Sheldon <aaron(dot)sheldon(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Measure Theoretic Data Types in Postgresql
Date: 2012-10-12 16:43:21
Message-ID: CADyahXbNVZkYv7+r3z-76Fce-xY8Yjeucrc1v_4bZ_=dOfigSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

So the key algorithmic inefficient is the inner join on the generated
series. Worst case scenario this compares every range to every date in the
series, which for m ranges and n dates yields O(m*n) operations. The
analysts in my shop currently write queries like this for billions of
records against thousands of dates and then go and take 8 hour coffee
breaks.

However, by realizing that the bounds on the ranges have a linear ordering
one can speed this up to 0(m) using windowing functions on common table
expressions.

So what I am proposing is formalizing this optimization into a class of
data types, that will hide the implementation details.

On Fri, Oct 12, 2012 at 1:48 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 11.10.2012 07:37, Aaron Sheldon wrote:
>
>> This would allow for a succinct syntax to do calculations such as
>> finding the daily unique patient count given the intervals of their
>> attendance in particular programs; a computation I encounter
>> routinely as a statistician for a health services provider.
>>
>
> Hmm. It's easy to get the count of unique patients on a particular date
> with something like:
>
> select count(distinct patient) from attendance where interval &&
> '2012-10-12'::date
>
> I guess what you're after is to get that count for a range of days, in one
> query, so that the result looks something like this:
>
> date | patients
> -----------+------------
> 2012-10-05 | 20
> 2012-10-06 | 24
> 2012-10-07 | 30
> 2012-10-08 | 29
>
> The way I think of that problem is that you need to join the dates you're
> interested in with the attendance table.
>
> select date, count (distinct patientid)
> from attendance
> inner join (
> select '2012-10-04'::date + a AS date from generate_series(1,20) a
> ) dates on interval @> date
> group by date;
> date | count
> ------------+-------
> 2012-10-05 | 11
> 2012-10-06 | 27
> 2012-10-07 | 47
> 2012-10-08 | 63
> 2012-10-09 | 83
> 2012-10-10 | 95
> 2012-10-11 | 80
> 2012-10-12 | 60
> 2012-10-13 | 35
> 2012-10-14 | 13
> (10 rows)
>
> I created the test table for that with:
>
> create table attendance (patientid int4 , interval daterange)
> insert into attendance select id, daterange('2012-10-05'::date +
> (random()*5)::int4, '2012-10-10'::date + (random()*5)::int4) from
> generate_series(1,100) id;
>
>
> So, I think the current range types already cover that use case pretty
> well. I can't imagine how the proposed measure theoretic concepts would
> make that simpler. Can you give some more complicated problem, perhaps,
> that the proposed measure theoretic concepts would make simpler than the
> current tools?
>
> - Heikki
>

--
Aaron Sheldon

#67 Westedge Townhouses
5019 46 Ave, SW
Calgary AB, T3E 6R1

(h) 1.403.453.6316
(c) 1.403.710.9357
aaron(dot)sheldon(at)gmail(dot)com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2012-10-12 16:55:49 Re: Deparsing DDL command strings
Previous Message Dimitri Fontaine 2012-10-12 16:41:49 Re: Truncate if exists