Re: Row pattern recognition

From: Vik Fearing <vik(at)postgresfriends(dot)org>
To: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Row pattern recognition
Date: 2023-06-25 21:08:35
Message-ID: fdf57a8d-65d5-e8a5-361f-3095d6899099@postgresfriends.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/25/23 14:05, Tatsuo Ishii wrote:
> Attached is a PoC patch to implement "Row pattern recognition" (RPR)
> in SQL:2016 (I know SQL:2023 is already out, but I don't have access
> to it). Actually SQL:2016 defines RPR in two places[1]:
>
> Feature R010, “Row pattern recognition: FROM clause”
> Feature R020, “Row pattern recognition: WINDOW clause”
>
> The patch includes partial support for R020 part.

I have been dreaming of and lobbying for someone to pick up this
feature. I will be sure to review it from a standards perspective and
will try my best to help with the technical aspect, but I am not sure to
have the qualifications for that.

THANK YOU!

> (I know SQL:2023 is already out, but I don't have access to it)

If you can, try to get ISO/IEC 19075-5 which is a guide to RPR instead
of just its technical specification.

https://www.iso.org/standard/78936.html

> - What is RPR?
>
> RPR provides a way to search series of data using regular expression
> patterns. Suppose you have a stock database.
>
> CREATE TABLE stock (
> company TEXT,
> tdate DATE,
> price BIGINT);
>
> You want to find a "V-shaped" pattern: i.e. price goes down for 1 or
> more days, then goes up for 1 or more days. If we try to implement the
> query in PostgreSQL, it could be quite complex and inefficient.
>
> RPR provides convenient way to implement the query.
>
> SELECT company, tdate, price, rpr(price) OVER w FROM stock
> WINDOW w AS (
> PARTITION BY company
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
> PATTERN (START DOWN+ UP+)
> DEFINE
> START AS TRUE,
> UP AS price > PREV(price),
> DOWN AS price < PREV(price)
> );
>
> "PATTERN" and "DEFINE" are the key clauses of RPR. DEFINE defines 3
> "row pattern variables" namely START, UP and DOWN. They are associated
> with logical expressions namely "TRUE", "price > PREV(price)", and
> "price < PREV(price)". Note that "PREV" function returns price column
> in the previous row. So, UP is true if price is higher than previous
> day. On the other hand, DOWN is true if price is lower than previous
> day. PATTERN uses the row pattern variables to create a necessary
> pattern. In this case, the first row is always match because START is
> always true, and second or more rows match with "UP" ('+' is a regular
> expression representing one or more), and subsequent rows match with
> "DOWN".
>
> Here is the sample output.
>
> company | tdate | price | rpr
> ----------+------------+-------+------
> company1 | 2023-07-01 | 100 |
> company1 | 2023-07-02 | 200 | 200 -- 200->150->140->150
> company1 | 2023-07-03 | 150 | 150 -- 150->140->150
> company1 | 2023-07-04 | 140 |
> company1 | 2023-07-05 | 150 | 150 -- 150->90->110->130
> company1 | 2023-07-06 | 90 |
> company1 | 2023-07-07 | 110 |
> company1 | 2023-07-08 | 130 |
> company1 | 2023-07-09 | 120 |
> company1 | 2023-07-10 | 130 |
>
> rpr shows the first row if all the patterns are satisfied. In the
> example above 200, 150, 150 are the cases. Other rows are shown as
> NULL. For example, on 2023-07-02 price starts with 200, then goes down
> to 150 then 140 but goes up 150 next day.

I don't understand this. RPR in a window specification limits the
window to the matched rows, so this looks like your rpr() function is
just the regular first_value() window function that we already have?

> As far as I know, only Oracle implements RPR (only R010. R020 is not
> implemented) among OSS/commercial RDBMSs. There are a few DWH software
> having RPR. According to [2] they are Snowflake and MS Stream
> Analytics. It seems Trino is another one[3].

I thought DuckDB had it already, but it looks like I was wrong.

> - Note about the patch
>
> The current implementation is not only a subset of the standard, but
> is different from it in some places. The example above is written as
> follows according to the standard:
>
> SELECT company, tdate, startprice OVER w FROM stock
> WINDOW w AS (
> PARTITION BY company
> MEASURES
> START.price AS startprice
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
> PATTERN (START DOWN+ UP+)
> DEFINE
> START AS TRUE,
> UP AS UP.price > PREV(UP.price),
> DOWN AS DOWN.price < PREV(DOWN.price)
> );
>
> Notice that rpr(price) is written as START.price and startprice in the
> standard. MEASURES defines variable names used in the target list used
> with "OVER window". As OVER only allows functions in PostgreSQL, I had
> to make up a window function "rpr" which performs the row pattern
> recognition task. I was not able to find a way to implement
> expressions like START.price (START is not a table alias). Any
> suggestion is greatly appreciated.

As in your example, you cannot have START.price outside of the window
specification; it can only go in the MEASURES clause. Only startprice
is allowed outside and it gets its qualification from the OVER. Using
w.startprice might have been better but that would require window names
to be in the same namespace as range tables.

This currently works in Postgres:

SELECT RANK() OVER w
FROM (VALUES (1)) AS w (x)
WINDOW w AS (ORDER BY w.x);

> The differences from the standard include:
>
> o MEASURES is not supported
> o FIRST, LAST, CLASSIFIER are not supported
> o PREV/NEXT in the standard accept more complex arguments
> o INITIAL or SEEK are not supported ((behaves as if INITIAL is specified)

Okay, for now.

> o SUBSET is not supported

Is this because you haven't done it yet, or because you ran into
problems trying to do it?

> o Regular expressions other than "+" are not supported

This is what I had a hard time imagining how to do while thinking about
it. The grammar is so different here and we allow many more operators
(like "?" which is also the standard parameter symbol). People more
expert than me will have to help here.

> o Only AFTER MATCH SKIP TO NEXT ROW is supported (if AFTER MATCH is
> not specified, AFTER MATCH SKIP TO NEXT ROW is assumed. In the
> standard AFTER MATCH SKIP PAST LAST ROW is assumed in this case). I
> have a plan to implement AFTER MATCH SKIP PAST LAST ROW though.

In this case, we should require the user to specify AFTER MATCH SKIP TO
NEXT ROW so that behavior doesn't change when we implement the standard
default. (Your patch might do this already.)

> o Aggregate functions associated with window clause using RPR do not respect RPR

I do not understand what this means.

> It seems RPR in the standard is quite complex. I think we can start
> with a small subset of RPR then we could gradually enhance the
> implementation.

I have no problem with that as long as we don't paint ourselves into a
corner.

> Comments and suggestions are welcome.

I have not looked at the patch yet, but is the reason for doing R020
before R010 because you haven't done the MEASURES clause yet?

In any case, I will be watching this with a close eye, and I am eager to
help in any way I can.
--
Vik Fearing

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2023-06-25 23:39:50 Re: eqjoinsel_semi still sucks ...
Previous Message David Rowley 2023-06-25 20:49:04 Re: Speeding Up Bitmapset