Measure Theoretic Data Types in Postgresql

From: Aaron Sheldon <aaron(dot)sheldon(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Measure Theoretic Data Types in Postgresql
Date: 2012-10-11 04:37:02
Message-ID: CADyahXZBmBS4ceqjPYidpFN8pA7GzE6PaBeCBd3SWM47ZFRbMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Release 9.2 introduced the range data types, which is a much needed
extension for a large number applications, particularly in statistics, and
measurement. The natural generalization of a range is a finite partition of
the underlying abstract data space into disjoint covering ranges; for
example two partitions of the real line could be

{(-\infty, -1], (-1, 0.5], (0.5, 0.75), [0.75, 25], (25, \infty)}

and

{(-\infty, -1), [-1, 1), [1, 1], (1, 2), [2, \infty)}

To be of use we have to assign a value to each range in the partition, this
defines a simple measurable function (see Billingsley "Probability and
Measure" or Rudin "Principles of Mathematical Analysis" chapter 11, for
introductory material on Measure Theory). For the two example partitions we
could have two functions mapping to a character datatype:

f : {(-\infty, -1] -> NULL, (-1, 0.5] -> 'A', (0.5, 0.75) -> 'B', [0.75,
25] -> 'A', (25, \infty) -> 'B'}

and

g : {(-\infty, -1) -> 'A', [-1, 1) -> 'B', [1, 1] -> 'C', (1, 2) -> NULL,
[2, \infty) -> 'A'}

The simple functions can be stored using a composite datatype containing
three arrays: an n element array of ascending boundary points, an n element
array of boundary topologies either left open ')[' or right open '](', and
an n+1 element array of values assigned to each range in the partition.
With the functions stored we can define point-wise operators for summation,
product, variance, etc... and more interestingly array and string
concatenation, and count distinct; the last example would result in the new
function:

h = count_distinct(f, g) : {(-\infty, -1] -> 1, (-1, 0.5] -> 2, (0.5, 0.75)
-> 1, [0.75, 1] -> 2, (1, 25] -> 1, (25, \infty) -> 2}

Which can be generalized to point-wise aggregates over finite lists of
measurable functions. While the point-wise aggregate functions are powerful
in and of themselves, the true power would be realised when dis-aggregating
the data type into the original ranges that form the partition. This would
allow for a succinct syntax to do calculations such as finding the daily
unique patient count given the intervals of their attendance in particular
programs; a computation I encounter routinely as a statistician for a
health services provider.

I have begun sketching these ideas in off the shelf pgSQL using composite
types and constructor functions; but am far from tackling anything like
building external C extensions to handle the data types. I can set-up a
GitHub repository if anyone is interested in seeing where I am taking this.
So far I have defined the composite types (lots of them) and the
constructor functions 'indicator(lower topology, lower point, upper point,
upper topology)' which creates an indicator (0/1) function of a range, and
'measure(lower topology, lower point, upper point, upper topology, value)'
which creates a measure function of the range; after this I will tackle the
dis-aggregation functions, the operators, and finally the aggregation
functions.

I realize similar work has been done in this vain using the set of sets
idea, but this work is meant to be an implementation of formal measure
theoretic concepts, and thus the algorithmic behaviour can be analysed and
proven using the tool kit of measure theory.

Aaron Sheldon

aaron(dot)sheldon(at)gmail(dot)com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2012-10-11 07:15:47 Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)
Previous Message Etsuro Fujita 2012-10-11 03:10:36 Re: Minor document updates