Re: Row pattern recognition

From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: jchampion(at)timescale(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, vik(at)postgresfriends(dot)org
Subject: Re: Row pattern recognition
Date: 2023-07-21 06:16:48
Message-ID: 20230721.151648.412762379013769790.t-ishii@sranhm.sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

> Hi Ishii-san,
>
> On 7/19/23 22:15, Tatsuo Ishii wrote:
>> Currently my patch has a limitation for the sake of simple
>> implementation: a pattern like "A+" is parsed and analyzed in the raw
>> parser. This makes subsequent process much easier because the pattern
>> element variable (in this case "A") and the quantifier (in this case
>> "+") is already identified by the raw parser. However there are much
>> more cases are allowed in the standard as you already pointed out. For
>> those cases probably we should give up to parse PATTERN items in the
>> raw parser, and instead the raw parser just accepts the elements as
>> Sconst?
>
> Is there a concern that the PATTERN grammar can't be represented in
> Bison? I thought it was all context-free...

I don't know at this point. I think context-free is not enough to be
repsented in Bison. The grammer also needs to be LALR(1). Moreover,
adding the grammer to existing parser may generate shift/reduce
errors.

> Or is the concern that the
> parse tree of the pattern is hard to feed into a regex engine?

One small concern is how to convert pattern variables to regex
expression which our regex enginer understands. Suppose,

PATTERN UP+

Currently I convert "UP+" to "U+" so that it can be fed to the regexp
engine. In order to do that, we need to know which part of the pattern
(UP+) is the pattern variable ("UP"). For "UP+" it's quite easy. But
for more complex regular expressions it would be not, unless PATTERN
grammer can be analyzed by our parser to know which part is the
pattern variable.

>> Any comments, especially on the PREV/NEXT implementation part is
>> welcome. Currently the DEFINE expression like "price > PREV(price)" is
>> prepared in ExecInitWindowAgg using ExecInitExpr,tweaking var->varno
>> in Var node so that PREV uses OUTER_VAR, NEXT uses INNER_VAR. Then
>> evaluate the expression in ExecWindowAgg using ExecEvalExpr, setting
>> previous row TupleSlot to ExprContext->ecxt_outertuple, and next row
>> TupleSlot to ExprContext->ecxt_innertuple. I think this is temporary
>> hack and should be gotten ride of before v1 is committed. Better idea?
>
> I'm not familiar enough with this code yet to offer very concrete
> suggestions, sorry... But at some point in the future, we need to be
> able to skip forward and backward from arbitrary points, like
>
> DEFINE B AS B.price > PREV(FIRST(A.price), 3)
>
> so there won't be just one pair of "previous and next" tuples.

Yes, I know.

> Maybe
> that can help clarify the design? It feels like it'll need to eventually
> be a "real" function that operates on the window state, even if it
> doesn't support all of the possible complexities in v1.

Unfortunately an window function can not call other window functions.

> Taking a closer look at the regex engine:
>
> It looks like the + qualifier has trouble when it meets the end of the
> frame. For instance, try removing the last row of the 'stock' table in
> rpr.sql; some of the final matches will disappear unexpectedly. Or try a
> pattern like
>
> PATTERN ( a+ )
> DEFINE a AS TRUE
>
> which doesn't seem to match anything in my testing.
>
> There's also the issue of backtracking in the face of reclassification,
> as I think Vik was alluding to upthread. The pattern
>
> PATTERN ( a+ b+ )
> DEFINE a AS col = 2,
> b AS col = 2
>
> doesn't match a sequence of values (2 2 ...) with the current
> implementation, even with a dummy row at the end to avoid the
> end-of-frame bug.
>
> (I've attached two failing tests against v2, to hopefully better
> illustrate, along with what I _think_ should be the correct results.)

Thanks. I will look into this.

> I'm not quite understanding the match loop in evaluate_pattern(). It
> looks like we're building up a string to pass to the regex engine, but
> by the we call regexp_instr, don't we already know whether or not the
> pattern will match based on the expression evaluation we've done?

For "+" yes. But for more complex regular expression like '{n}', we
need to call our regexp engine to check if the pattern matches.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gurjeet Singh 2023-07-21 06:19:54 Re: MERGE ... RETURNING
Previous Message Junwang Zhao 2023-07-21 06:05:56 table_open/table_close with different lock mode