[PROPOSAL] Temporal query processing with range types

From: Anton Dignös <dignoes(at)inf(dot)unibz(dot)it>
To: <pgsql-hackers(at)postgresql(dot)org>
Cc: Johann Gamper <gamper(at)inf(dot)unibz(dot)it>, Michael Böhlen <boehlen(at)ifi(dot)uzh(dot)ch>, Moser Peter <peter(dot)moser(at)unibz(dot)it>
Subject: [PROPOSAL] Temporal query processing with range types
Date: 2016-07-22 11:15:17
Message-ID: CALNdv1h7TUP24Nro53KecvWB2kwA67p+PByDuP6_1GeESTFgSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi hackers,

we are a group of researches that work on temporal databases. Our main focus is
the processing of data with time intervals, such as the range types in
PostgreSQL.

For this type of processing the range types that are stored with the tuples are
considered to represent the tuples' valid time, such as for instance the time
period an employee works for a project. Intuitively, the processing of temporal
operators corresponds to the use of "at each time point" in natural language.
For example, a temporal count on such a relation (temporal relation) shall
determine the count at each time point, while an anti-join shall compute an
anti-join at each time point.

We would like to contribute to PostgreSQL a solution that supports the query
processing of "at each time point". The basic idea is to offer two new
operators, NORMALIZE and ALIGN, whose purpose is to adjust (or split) the
ranges of tuples so that subsequent queries can use the usual grouping and
equality conditions to get the intended results.

Query processing works as follows:
<temporal operator> := <NORMALIZE | ALIGN> + <traditional operator>

Attached is a proposal patch file that implements this extension and on which
the example queries below can be run. The support includes distinct,
aggregation, set operations, Cartesian product, Join, Left Join, Right Join,
Full Join, and Anti-join.

In our prototype the syntax for the two new operators is:

... FROM (table_ref NORMALIZE table_ref
USING(att_list)
WITH (columnref, columnref)
) alias_clause ...

and

... FROM (table_ref ALIGN table_ref
ON a_expr
WITH (columnref, columnref)
) alias_clause ...

where NORMALIZE is used for distinct, aggregation and set operations and ALIGN
is used for Cartesian product, Join, Left Join, Right Join, Full Join, and
Anti-join.

-------------------------------------------------------------------------------
EXAMPLE FOR AGGREGATION
-------------------------------------------------------------------------------

Relation 'projects' records when employees are assigned to projects.

CREATE TABLE projects (empl VARCHAR, proj VARCHAR, pay FLOAT, t DATERANGE);
INSERT INTO projects VALUES
('Ann', 'P1', 60, DATERANGE('2016-01-01', '2016-09-01')),
('Sam', 'P1', 40, DATERANGE('2016-06-01', '2016-12-01')),
('Joe', 'P2', 80, DATERANGE('2016-03-01', '2016-06-01'));

TABLE projects;

empl | proj | pay | t
------+------+-----+-------------------------
Ann | P1 | 60 | [2016-01-01,2016-09-01)
Sam | P1 | 40 | [2016-06-01,2016-12-01)
Joe | P2 | 80 | [2016-03-01,2016-06-01)
(3 rows)

QUERY1: At each time point, what is the number of employees assigned
to projects?

First, we use NORMALIZE to adjust the ranges so that they can be
aggregated as usual by using grouping on the adjusted timestamp t:

SELECT *
FROM (projects p1 NORMALIZE projects p2 USING() WITH(t,t)) p_adjusted;

empl | proj | pay | t
------+------+-----+-------------------------
Ann | P1 | 60 | [2016-01-01,2016-03-01)
Ann | P1 | 60 | [2016-03-01,2016-06-01)
Ann | P1 | 60 | [2016-06-01,2016-09-01)
Sam | P1 | 40 | [2016-06-01,2016-09-01)
Sam | P1 | 40 | [2016-09-01,2016-12-01)
Joe | P2 | 80 | [2016-03-01,2016-06-01)
(6 rows)

Observe that the time ranges have been adjusted, and it is now trivial
to compute the count of employees at each time point by grouping on t:

SELECT COUNT(*), t
FROM (projects p1 NORMALIZE projects p2 USING() WITH(t,t)) p_adjusted
GROUP BY t;

count | t
-------+-------------------------
1 | [2016-01-01,2016-03-01)
2 | [2016-03-01,2016-06-01)
2 | [2016-06-01,2016-09-01)
1 | [2016-09-01,2016-12-01)
(4 rows)

QUERY2: At each time point, what is the number of employees assigned
to EACH project?

SELECT *
FROM (projects p1 NORMALIZE projects p2 USING(proj) WITH(t,t)) p_adjusted;

empl | proj | pay | t
------+------+-----+-------------------------
Ann | P1 | 60 | [2016-01-01,2016-06-01)
Ann | P1 | 60 | [2016-06-01,2016-09-01)
Sam | P1 | 40 | [2016-06-01,2016-09-01)
Sam | P1 | 40 | [2016-09-01,2016-12-01)
Joe | P2 | 80 | [2016-03-01,2016-06-01)
(5 rows)

and

SELECT COUNT(*), proj, t
FROM (projects p1 NORMALIZE projects p2 USING(proj) WITH(t,t)) p_adjusted
GROUP BY proj, t;

count | proj | t
-------+------+-------------------------
1 | P2 | [2016-03-01,2016-06-01)
1 | P1 | [2016-01-01,2016-06-01)
2 | P1 | [2016-06-01,2016-09-01)
1 | P1 | [2016-09-01,2016-12-01)
(4 rows)

-------------------------------------------------------------------------------
EXAMPLE FOR ANTI-JOIN
-------------------------------------------------------------------------------

QUERY3: At each time point, deterime the employees with the highest
salary.

Without time this query is answered with a NOT EXISTS subquery. With
time periods everything remains, but the time ranges must be adjusted
first:

SELECT *
FROM (projects p1 ALIGN projects p2 ON p1.pay < p2.pay WITH (t,t)) p1_adjusted
WHERE NOT EXISTS (
SELECT *
FROM (projects p2 ALIGN projects p1 ON p1.pay < p2.pay WITH (t,t)) p2_adjusted
WHERE p1_adjusted.pay < p2_adjusted.pay
AND p1_adjusted.t = p2_adjusted.t );

empl | proj | pay | t
------+------+-----+-------------------------
Ann | P1 | 60 | [2016-01-01,2016-03-01)
Ann | P1 | 60 | [2016-06-01,2016-09-01)
Sam | P1 | 40 | [2016-09-01,2016-12-01)
Joe | P2 | 80 | [2016-03-01,2016-06-01)
(4 rows)

-------------------------------------------------------------------------------
EXAMPLE FOR JOINS
-------------------------------------------------------------------------------

CREATE TABLE managers (mgr VARCHAR, proj VARCHAR, t DATERANGE);
INSERT INTO managers VALUES
('Tom', 'P1', DATERANGE('2016-09-01', '2016-12-01')),
('Frank', 'P2', DATERANGE('2016-01-01', '2016-12-01'));

TABLE managers;

mgr | proj | t
-------+------+-------------------------
Tom | P1 | [2016-09-01,2016-12-01)
Frank | P2 | [2016-01-01,2016-12-01)
(2 rows)

QUERY4: At each time point, determine employees and their managers
(including the periods with no employee or no manager).

WITH p_ext AS (SELECT *, t u FROM projects),
m_ext AS (SELECT *, t v FROM managers)
SELECT proj, empl, mgr, t
FROM (p_ext ALIGN m_ext ON m_ext.proj = p_ext.proj WITH (t,t)) p_adjusted
FULL OUTER JOIN
(m_ext ALIGN p_ext ON m_ext.proj = p_ext.proj WITH (t,t)) m_adjusted
USING (proj, t)
WHERE t = u * v OR v IS NULL OR u IS NULL;

proj | empl | mgr | t
------+------+-------+-------------------------
P1 | Ann | | [2016-01-01,2016-09-01)
P1 | Sam | | [2016-06-01,2016-09-01)
P1 | Sam | Tom | [2016-09-01,2016-12-01)
P2 | | Frank | [2016-01-01,2016-03-01)
P2 | Joe | Frank | [2016-03-01,2016-06-01)
P2 | | Frank | [2016-06-01,2016-12-01)
(6 rows)

-------------------------------------------------------------------------------
IMPLEMENTATION
-------------------------------------------------------------------------------

Our implementation approach is to use two operators that take care of
adjusting the ranges in such a way that afterwards the corresponding
nontemporal operator can be used to compute the result:
input --> {NORMALIZE/ALIGN} --> nontemporal operator --> result.

The two operators (NORMALIZE and ALIGN) in this prototype are rewritten into
subqueries, which then serve as input for a new executor function
ExecTemporalAdjustment. The executor function takes care of the splitting of
the range types.

For NORMALIZE the tuples' ranges need to be split into all sub-ranges
according to all matching ranges of the second relation. For this we
create a subquery that first joins one relation with the range
boundaries of the other and then sorts the result. The executor
function splits the ranges in a sweep-line based manner.

For ALIGN the tuples' ranges must be split into all intersections and
differences with the other relation according to the join condition.
For this we create a subquery that first joins the two relations and
then sorts the result. The executor function splits the ranges
accordingly in a sweep-line based manner.

After the splitting it is possible to only use equality (GROUP BY,
DISTINCT, JOIN CONDITION, ...) on the range types to get the intended
result.

The subquery is 'visible' with the EXPLAIN command.

Range types need to be of half-open, i.e., [lower, upper).

Unbound ranges (Infinity), NaN, and NULL values (of ranges and range
bounds) are not yet supported.

-------------------------------------------------------------------------------

We are grateful for any feedback.

Best regards,

Anton, Johann, Michael, Peter

Attachment Content-Type Size
tpg_primitives_out_v1.patch application/octet-stream 116.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2016-07-22 13:33:32 Re: Rethinking TupleTableSlot deforming
Previous Message Guangzhou Zhang 2016-07-22 10:36:45 Re: Unexpected memory usage for repeated inserts within plpgsql function