Re: Row pattern recognition

From: Vik Fearing <vik(at)postgresfriends(dot)org>
To: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, jchampion(at)timescale(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Row pattern recognition
Date: 2023-07-28 09:21:25
Message-ID: c9ebc3d0-c3d1-e8eb-4a57-0ec099cbda17@postgresfriends.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/26/23 14:21, Tatsuo Ishii wrote:
> Attached is the v3 patch. In this patch following changes are made.

Excellent. Thanks!

A few quick comments:

- PERMUTE is still misspelled as PREMUTE

- PATTERN variables do not have to exist in the DEFINE clause. They are
considered TRUE if not present.

> (1) I completely changed the pattern matching engine so that it
> performs backtracking. Now the engine evaluates all pattern elements
> defined in PATTER against each row, saving matched pattern variables
> in a string per row. For example if the pattern element A and B
> evaluated to true, a string "AB" is created for current row.
>
> This continues until all pattern matching fails or encounters the end
> of full window frame/partition. After that, the pattern matching
> engine creates all possible "pattern strings" and apply the regular
> expression matching to each. For example if we have row 0 = "AB" row 1
> = "C", possible pattern strings are: "AC" and "BC".
>
> If it matches, the length of matching substring is saved. After all
> possible trials are done, the longest matching substring is chosen and
> it becomes the width (number of rows) in the reduced window frame.
>
> See row_is_in_reduced_frame, search_str_set and search_str_set_recurse
> in nodeWindowAggs.c for more details. For now I use a naive depth
> first search and probably there is a lot of rooms for optimization
> (for example rewriting it without using
> recursion). Suggestions/patches are welcome.

My own reviews will only focus on correctness for now. Once we get a
good set of regression tests all passing, I will focus more on
optimization. Of course, others might want to review the performance now.

> Vik Fearing wrote:
>> I strongly disagree with this. Window function do not need to know
>> how the frame is defined, and indeed they should not.
>> WinGetFuncArgInFrame should answer yes or no and the window function
>> just works on that. Otherwise we will get extension (and possibly even
>> core) functions that don't handle the frame properly.
>
> So I moved row_is_in_reduce_frame into WinGetFuncArgInFrame so that
> those window functions are not needed to be changed.
>
> (3) Window function rpr was removed. We can use first_value instead.

Excellent.

> (4) Remaining tasks/issues.
>
> - For now I disable WinSetMarkPosition because RPR often needs to
> access a row before the mark is set. We need to fix this in the
> future.

Noted, and agreed.

> - I am working on making window aggregates RPR aware now. The
> implementation is in progress and far from completeness. An example
> is below. I think row 2, 3, 4 of "count" column should be NULL
> instead of 3, 2, 0, 0. Same thing can be said to other
> rows. Probably this is an effect of moving aggregate but I still
> studying the window aggregation code.

This tells me again that RPR is not being run in the right place. All
windowed aggregates and frame-level window functions should Just Work
with no modification.

> SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock
> WINDOW w AS (
> PARTITION BY company
> ORDER BY tdate
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
> AFTER MATCH SKIP PAST LAST ROW
> INITIAL
> PATTERN (START UP+ DOWN+)
> DEFINE
> START AS TRUE,
> UP AS price > PREV(price),
> DOWN AS price < PREV(price)
> );
> company | tdate | first_value | count
> ----------+------------+-------------+-------
> company1 | 2023-07-01 | 100 | 4
> company1 | 2023-07-02 | | 3
> company1 | 2023-07-03 | | 2
> company1 | 2023-07-04 | | 0
> company1 | 2023-07-05 | | 0
> company1 | 2023-07-06 | 90 | 4
> company1 | 2023-07-07 | | 3
> company1 | 2023-07-08 | | 2
> company1 | 2023-07-09 | | 0
> company1 | 2023-07-10 | | 0
> company2 | 2023-07-01 | 50 | 4
> company2 | 2023-07-02 | | 3
> company2 | 2023-07-03 | | 2
> company2 | 2023-07-04 | | 0
> company2 | 2023-07-05 | | 0
> company2 | 2023-07-06 | 60 | 4
> company2 | 2023-07-07 | | 3
> company2 | 2023-07-08 | | 2
> company2 | 2023-07-09 | | 0
> company2 | 2023-07-10 | | 0

In this scenario, row 1's frame is the first 5 rows and specified SKIP
PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are
skipped) and the result of the outer count should be 0 for all of them
because there are no rows in the frame.

When we get to adding count in the MEASURES clause, there will be a
difference between no match and empty match, but that does not apply here.

> I am going to add this thread to CommitFest and plan to add both of
> you as reviewers. Thanks in advance.

My pleasure. Thank you for working on this difficult feature.
--
Vik Fearing

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-07-28 09:42:02 Re: logical decoding and replication of sequences, take 2
Previous Message Michael Paquier 2023-07-28 09:19:22 Re: Support worker_spi to execute the function dynamically.