Row pattern recognition

From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Row pattern recognition
Date: 2023-06-25 12:05:09
Message-ID: 20230625.210509.1276733411677577841.t-ishii@sranhm.sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached is a PoC patch to implement "Row pattern recognition" (RPR)
in SQL:2016 (I know SQL:2023 is already out, but I don't have access
to it). Actually SQL:2016 defines RPR in two places[1]:

Feature R010, “Row pattern recognition: FROM clause”
Feature R020, “Row pattern recognition: WINDOW clause”

The patch includes partial support for R020 part.

- What is RPR?

RPR provides a way to search series of data using regular expression
patterns. Suppose you have a stock database.

CREATE TABLE stock (
company TEXT,
tdate DATE,
price BIGINT);

You want to find a "V-shaped" pattern: i.e. price goes down for 1 or
more days, then goes up for 1 or more days. If we try to implement the
query in PostgreSQL, it could be quite complex and inefficient.

RPR provides convenient way to implement the query.

SELECT company, tdate, price, rpr(price) OVER w FROM stock
WINDOW w AS (
PARTITION BY company
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
PATTERN (START DOWN+ UP+)
DEFINE
START AS TRUE,
UP AS price > PREV(price),
DOWN AS price < PREV(price)
);

"PATTERN" and "DEFINE" are the key clauses of RPR. DEFINE defines 3
"row pattern variables" namely START, UP and DOWN. They are associated
with logical expressions namely "TRUE", "price > PREV(price)", and
"price < PREV(price)". Note that "PREV" function returns price column
in the previous row. So, UP is true if price is higher than previous
day. On the other hand, DOWN is true if price is lower than previous
day. PATTERN uses the row pattern variables to create a necessary
pattern. In this case, the first row is always match because START is
always true, and second or more rows match with "UP" ('+' is a regular
expression representing one or more), and subsequent rows match with
"DOWN".

Here is the sample output.

company | tdate | price | rpr
----------+------------+-------+------
company1 | 2023-07-01 | 100 |
company1 | 2023-07-02 | 200 | 200 -- 200->150->140->150
company1 | 2023-07-03 | 150 | 150 -- 150->140->150
company1 | 2023-07-04 | 140 |
company1 | 2023-07-05 | 150 | 150 -- 150->90->110->130
company1 | 2023-07-06 | 90 |
company1 | 2023-07-07 | 110 |
company1 | 2023-07-08 | 130 |
company1 | 2023-07-09 | 120 |
company1 | 2023-07-10 | 130 |

rpr shows the first row if all the patterns are satisfied. In the
example above 200, 150, 150 are the cases. Other rows are shown as
NULL. For example, on 2023-07-02 price starts with 200, then goes down
to 150 then 140 but goes up 150 next day.

As far as I know, only Oracle implements RPR (only R010. R020 is not
implemented) among OSS/commercial RDBMSs. There are a few DWH software
having RPR. According to [2] they are Snowflake and MS Stream
Analytics. It seems Trino is another one[3].

- Note about the patch

The current implementation is not only a subset of the standard, but
is different from it in some places. The example above is written as
follows according to the standard:

SELECT company, tdate, startprice OVER w FROM stock
WINDOW w AS (
PARTITION BY company
MEASURES
START.price AS startprice
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
PATTERN (START DOWN+ UP+)
DEFINE
START AS TRUE,
UP AS UP.price > PREV(UP.price),
DOWN AS DOWN.price < PREV(DOWN.price)
);

Notice that rpr(price) is written as START.price and startprice in the
standard. MEASURES defines variable names used in the target list used
with "OVER window". As OVER only allows functions in PostgreSQL, I had
to make up a window function "rpr" which performs the row pattern
recognition task. I was not able to find a way to implement
expressions like START.price (START is not a table alias). Any
suggestion is greatly appreciated.

The differences from the standard include:

o MEASURES is not supported
o SUBSET is not supported
o FIRST, LAST, CLASSIFIER are not supported
o PREV/NEXT in the standard accept more complex arguments
o Regular expressions other than "+" are not supported
o Only AFTER MATCH SKIP TO NEXT ROW is supported (if AFTER MATCH is
not specified, AFTER MATCH SKIP TO NEXT ROW is assumed. In the
standard AFTER MATCH SKIP PAST LAST ROW is assumed in this case). I
have a plan to implement AFTER MATCH SKIP PAST LAST ROW though.
o INITIAL or SEEK are not supported ((behaves as if INITIAL is specified)
o Aggregate functions associated with window clause using RPR do not respect RPR

It seems RPR in the standard is quite complex. I think we can start
with a small subset of RPR then we could gradually enhance the
implementation.

Comments and suggestions are welcome.

[1] https://sqlperformance.com/2019/04/t-sql-queries/row-pattern-recognition-in-sql
[2] https://link.springer.com/article/10.1007/s13222-022-00404-3
[3] https://trino.io/docs/current/sql/pattern-recognition-in-window.html
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Attachment Content-Type Size
v1-0001-Row-pattern-recognition-patch-for-raw-parser.patch text/x-patch 21.1 KB
v1-0002-Row-pattern-recognition-patch-parse-analysis.patch text/x-patch 8.8 KB
v1-0003-Row-pattern-recognition-patch-planner.patch text/x-patch 3.1 KB
v1-0004-Row-pattern-recognition-patch-executor.patch text/x-patch 21.0 KB
v1-0005-Row-pattern-recognition-patch-docs.patch text/x-patch 7.7 KB
v1-0006-Row-pattern-recognition-patch-tests.patch text/x-patch 14.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2023-06-25 12:28:36 Speeding Up Bitmapset
Previous Message Joel Jacobson 2023-06-25 09:42:52 Re: Do we want a hashset type?