Re: Row pattern recognition

From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: jchampion(at)timescale(dot)com
Cc: er(at)xs4all(dot)nl, vik(at)postgresfriends(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Row pattern recognition
Date: 2023-09-08 03:54:47
Message-ID: 20230908.125447.478562939918621603.t-ishii@sranhm.sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

> Hello!
>
>> (1) I completely changed the pattern matching engine so that it
>> performs backtracking. Now the engine evaluates all pattern elements
>> defined in PATTER against each row, saving matched pattern variables
>> in a string per row. For example if the pattern element A and B
>> evaluated to true, a string "AB" is created for current row.
>
> If I understand correctly, this strategy assumes that one row's
> membership in a pattern variable is independent of the other rows'
> membership. But I don't think that's necessarily true:
>
> DEFINE
> A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A',
> ...

But:

UP AS price > PREV(price)

also depends on previous row, no? Can you please elaborate how your
example could break current implementation? I cannot test it because
CLASSIFIER is not implemented yet.

>> See row_is_in_reduced_frame, search_str_set and search_str_set_recurse
>> in nodeWindowAggs.c for more details. For now I use a naive depth
>> first search and probably there is a lot of rooms for optimization
>> (for example rewriting it without using
>> recursion).
>
> The depth-first match is doing a lot of subtle work here. For example, with
>
> PATTERN ( A+ B+ )
> DEFINE A AS TRUE,
> B AS TRUE
>
> (i.e. all rows match both variables), and three rows in the partition,
> our candidates will be tried in the order
>
> aaa
> aab
> aba
> abb
> ...
> bbb
>
> The only possible matches against our regex `^a+b+` are "aab" and "abb",
> and that order matches the preferment order, so it's fine. But it's easy
> to come up with a pattern where that's the wrong order, like
>
> PATTERN ( A+ (B|A)+ )
>
> Now "aaa" will be considered before "aab", which isn't correct.

Can you explain a little bit more? I think 'aaa' matches a regular
expression 'a+(b|a)+' and should be no problem before "aab" is
considered.

> Similarly, the assumption that we want to match the longest string only
> works because we don't allow alternation yet.

Can you please clarify more on this?

> Cool, I will give this piece some more thought. Do you mind if I try to
> add some more complicated pattern quantifiers to stress the
> architecture, or would you prefer to tackle that later? Just alternation
> by itself will open up a world of corner cases.

Do you mean you want to provide a better patch for the pattern
matching part? That will be helpfull. Because I am currently working
on the aggregation part and have no time to do it. However, the
aggregation work affects the v5 patch: it needs a refactoring. So can
you wait until I release v6 patch? I hope it will be released in two
weeks or so.

> On 8/9/23 01:41, Tatsuo Ishii wrote:
>> - I found a bug with pattern matching code. It creates a string for
>> subsequent regular expression matching. It uses the initial letter
>> of each define variable name. For example, if the varname is "foo",
>> then "f" is used. Obviously this makes trouble if we have two or
>> more variables starting with same "f" (e.g. "food"). To fix this, I
>> assign [a-z] to each variable instead of its initial letter. However
>> this way limits us not to have more than 26 variables. I hope 26 is
>> enough for most use cases.
>
> There are still plenty of alphanumerics left that could be assigned...
>
> But I'm wondering if we might want to just implement the NFA directly?
> The current implementation's Cartesian explosion can probably be pruned
> aggressively, but replaying the entire regex match once for every
> backtracked step will still duplicate a lot of work.

Not sure if you mean implementing new regular expression engine
besides src/backend/regexp. I am afraid it's not a trivial work. The
current regexp code consists of over 10k lines. What do you think?

> I've attached another test case; it looks like last_value() is depending
> on some sort of side effect from either first_value() or nth_value(). I
> know the window frame itself is still under construction, so apologies
> if this is an expected failure.

Thanks. Fortunately current code which I am working passes the new
test. I will include it in the next (v6) patch.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-09-08 04:07:13 Re: Eager page freeze criteria clarification
Previous Message Thomas Munro 2023-09-08 03:48:43 Re: old_snapshot_threshold bottleneck on replica