Re: Row pattern recognition

From: Jacob Champion <jchampion(at)timescale(dot)com>
To: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Vik Fearing <vik(at)postgresfriends(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Row pattern recognition
Date: 2023-09-08 00:00:07
Message-ID: fcf6dcfa-5a61-00ea-4311-d3d003b790ed@timescale.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello!

> (1) I completely changed the pattern matching engine so that it
> performs backtracking. Now the engine evaluates all pattern elements
> defined in PATTER against each row, saving matched pattern variables
> in a string per row. For example if the pattern element A and B
> evaluated to true, a string "AB" is created for current row.

If I understand correctly, this strategy assumes that one row's
membership in a pattern variable is independent of the other rows'
membership. But I don't think that's necessarily true:

DEFINE
A AS PREV(CLASSIFIER()) IS DISTINCT FROM 'A',
...

> See row_is_in_reduced_frame, search_str_set and search_str_set_recurse
> in nodeWindowAggs.c for more details. For now I use a naive depth
> first search and probably there is a lot of rooms for optimization
> (for example rewriting it without using
> recursion).

The depth-first match is doing a lot of subtle work here. For example, with

PATTERN ( A+ B+ )
DEFINE A AS TRUE,
B AS TRUE

(i.e. all rows match both variables), and three rows in the partition,
our candidates will be tried in the order

aaa
aab
aba
abb
...
bbb

The only possible matches against our regex `^a+b+` are "aab" and "abb",
and that order matches the preferment order, so it's fine. But it's easy
to come up with a pattern where that's the wrong order, like

PATTERN ( A+ (B|A)+ )

Now "aaa" will be considered before "aab", which isn't correct.

Similarly, the assumption that we want to match the longest string only
works because we don't allow alternation yet.

> Suggestions/patches are welcome.

Cool, I will give this piece some more thought. Do you mind if I try to
add some more complicated pattern quantifiers to stress the
architecture, or would you prefer to tackle that later? Just alternation
by itself will open up a world of corner cases.

> With the new engine, cases above do not fail anymore. See new
> regression test cases. Thanks for providing valuable test cases!

You're very welcome!

On 8/9/23 01:41, Tatsuo Ishii wrote:
> - I found a bug with pattern matching code. It creates a string for
> subsequent regular expression matching. It uses the initial letter
> of each define variable name. For example, if the varname is "foo",
> then "f" is used. Obviously this makes trouble if we have two or
> more variables starting with same "f" (e.g. "food"). To fix this, I
> assign [a-z] to each variable instead of its initial letter. However
> this way limits us not to have more than 26 variables. I hope 26 is
> enough for most use cases.

There are still plenty of alphanumerics left that could be assigned...

But I'm wondering if we might want to just implement the NFA directly?
The current implementation's Cartesian explosion can probably be pruned
aggressively, but replaying the entire regex match once for every
backtracked step will still duplicate a lot of work.

--

I've attached another test case; it looks like last_value() is depending
on some sort of side effect from either first_value() or nth_value(). I
know the window frame itself is still under construction, so apologies
if this is an expected failure.

Thanks!
--Jacob

Attachment Content-Type Size
test-last_value.diff.txt text/plain 2.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-09-08 00:07:13 Re: Impact of checkpointer during pg_upgrade
Previous Message Michael Paquier 2023-09-07 23:48:12 Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?