Re: Row pattern recognition

From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: vik(at)postgresfriends(dot)org, jchampion(at)timescale(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Row pattern recognition
Date: 2023-07-26 12:21:34
Message-ID: 20230726.212134.193373300820157886.t-ishii@sranhm.sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached is the v3 patch. In this patch following changes are made.

(1) I completely changed the pattern matching engine so that it
performs backtracking. Now the engine evaluates all pattern elements
defined in PATTER against each row, saving matched pattern variables
in a string per row. For example if the pattern element A and B
evaluated to true, a string "AB" is created for current row.

This continues until all pattern matching fails or encounters the end
of full window frame/partition. After that, the pattern matching
engine creates all possible "pattern strings" and apply the regular
expression matching to each. For example if we have row 0 = "AB" row 1
= "C", possible pattern strings are: "AC" and "BC".

If it matches, the length of matching substring is saved. After all
possible trials are done, the longest matching substring is chosen and
it becomes the width (number of rows) in the reduced window frame.

See row_is_in_reduced_frame, search_str_set and search_str_set_recurse
in nodeWindowAggs.c for more details. For now I use a naive depth
first search and probably there is a lot of rooms for optimization
(for example rewriting it without using
recursion). Suggestions/patches are welcome.

Jacob Champion wrote:
> It looks like the + qualifier has trouble when it meets the end of the
> frame. For instance, try removing the last row of the 'stock' table in
> rpr.sql; some of the final matches will disappear unexpectedly. Or try a
> pattern like
>
> PATTERN ( a+ )
> DEFINE a AS TRUE
>
> which doesn't seem to match anything in my testing.
>
> There's also the issue of backtracking in the face of reclassification,
> as I think Vik was alluding to upthread. The pattern
>
> PATTERN ( a+ b+ )
> DEFINE a AS col = 2,
> b AS col = 2

With the new engine, cases above do not fail anymore. See new
regression test cases. Thanks for providing valuable test cases!

(2) Make window functions RPR aware. Now first_value, last_value, and
nth_value recognize RPR (maybe first_value do not need any change?)

Vik Fearing wrote:
> I strongly disagree with this. Window function do not need to know
> how the frame is defined, and indeed they should not.
> WinGetFuncArgInFrame should answer yes or no and the window function
> just works on that. Otherwise we will get extension (and possibly even
> core) functions that don't handle the frame properly.

So I moved row_is_in_reduce_frame into WinGetFuncArgInFrame so that
those window functions are not needed to be changed.

(3) Window function rpr was removed. We can use first_value instead.

(4) Remaining tasks/issues.

- For now I disable WinSetMarkPosition because RPR often needs to
access a row before the mark is set. We need to fix this in the
future.

- I am working on making window aggregates RPR aware now. The
implementation is in progress and far from completeness. An example
is below. I think row 2, 3, 4 of "count" column should be NULL
instead of 3, 2, 0, 0. Same thing can be said to other
rows. Probably this is an effect of moving aggregate but I still
studying the window aggregation code.

SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock
WINDOW w AS (
PARTITION BY company
ORDER BY tdate
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP PAST LAST ROW
INITIAL
PATTERN (START UP+ DOWN+)
DEFINE
START AS TRUE,
UP AS price > PREV(price),
DOWN AS price < PREV(price)
);
company | tdate | first_value | count
----------+------------+-------------+-------
company1 | 2023-07-01 | 100 | 4
company1 | 2023-07-02 | | 3
company1 | 2023-07-03 | | 2
company1 | 2023-07-04 | | 0
company1 | 2023-07-05 | | 0
company1 | 2023-07-06 | 90 | 4
company1 | 2023-07-07 | | 3
company1 | 2023-07-08 | | 2
company1 | 2023-07-09 | | 0
company1 | 2023-07-10 | | 0
company2 | 2023-07-01 | 50 | 4
company2 | 2023-07-02 | | 3
company2 | 2023-07-03 | | 2
company2 | 2023-07-04 | | 0
company2 | 2023-07-05 | | 0
company2 | 2023-07-06 | 60 | 4
company2 | 2023-07-07 | | 3
company2 | 2023-07-08 | | 2
company2 | 2023-07-09 | | 0
company2 | 2023-07-10 | | 0

- If attributes appearing in DEFINE are not used in the target list, query fails.

SELECT company, tdate, count(*) OVER w FROM stock
WINDOW w AS (
PARTITION BY company
ORDER BY tdate
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP PAST LAST ROW
INITIAL
PATTERN (START UP+ DOWN+)
DEFINE
START AS TRUE,
UP AS price > PREV(price),
DOWN AS price < PREV(price)
);
ERROR: attribute number 3 exceeds number of columns 2

This is because attributes appearing in DEFINE are not added to the
target list. I am looking for way to teach planner to add attributes
appearing in DEFINE.

I am going to add this thread to CommitFest and plan to add both of
you as reviewers. Thanks in advance.
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Attachment Content-Type Size
v3-0001-Row-pattern-recognition-patch-for-raw-parser.patch text/x-patch 21.2 KB
v3-0002-Row-pattern-recognition-patch-parse-analysis.patch text/x-patch 8.6 KB
v3-0003-Row-pattern-recognition-patch-planner.patch text/x-patch 3.6 KB
v3-0004-Row-pattern-recognition-patch-executor.patch text/x-patch 31.5 KB
v3-0005-Row-pattern-recognition-patch-docs.patch text/x-patch 7.9 KB
v3-0006-Row-pattern-recognition-patch-tests.patch text/x-patch 24.6 KB
v3-0007-Allow-to-print-raw-parse-tree.patch text/x-patch 749 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-07-26 12:25:17 Re: Synchronizing slots from primary to standby
Previous Message Nikita Malakhov 2023-07-26 11:40:07 Re: POC: Extension for adding distributed tracing - pg_tracing