Re: Row pattern recognition

From: Henson Choi <assam258(at)gmail(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: jacob(dot)champion(at)enterprisedb(dot)com, david(dot)g(dot)johnston(at)gmail(dot)com, vik(at)postgresfriends(dot)org, er(at)xs4all(dot)nl, peter(at)eisentraut(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Row pattern recognition
Date: 2026-01-21 03:27:15
Message-ID: CAAAe_zAEg7sVM=WDwXMyE-odGmQyXSVi5ZzWgye6SupSjdMKpg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I'd like to propose an extension to the Row Pattern Recognition (RPR)
absorption mechanism to handle anchored patterns (those starting with
START or similar prefix elements).

== Background ==

The current RPR implementation includes an absorption optimization that
reduces time complexity from O(n²) to O(n) for unbounded quantifiers
like A+. When multiple match contexts are at the same ABSORBABLE element,
the context that started earlier (having matched more rows) absorbs
later-starting contexts, eliminating redundant processing.

For example, with pattern "A+ B":
- Ctx1 starts at row 0, matching A
- Ctx2 starts at row 1, matching A
- Ctx1 always has more A's → Ctx2 is absorbed (removed)
- Result: Only 1 context maintained, O(n) complexity

== The Problem ==

This absorption mechanism breaks down for anchored patterns like
"START SECOND A+ B". The PREFIX elements (START, SECOND) block
absorption because contexts at different PREFIX positions cannot
be meaningfully compared.

Without absorption, we regress to O(n²) behavior as contexts
accumulate without being eliminated.

== Proposed Solution: Original/Alternate Simultaneous Progression ==

The key insight is to track two logical "paths" within each context:

1. Original: Starts at Element 0 (PREFIX + BODY)
- can_absorb = true
- Can be used as matchedState (actual match result)

2. Alternate: Starts at Element 2 (BODY only, skipping PREFIX)
- can_be_absorbed = true
- Used only for absorption eligibility checking
- Cannot be used as matchedState

Both paths process the same row data simultaneously. The alternate
path serves as a "shadow" that indicates whether the context is
eligible for absorption.

== Absorption Conditions ==

For Ctx1 to absorb Ctx2, all conditions must be met:

1. Ctx1.can_absorb = true (original not outside pattern region)
2. Ctx2.can_be_absorbed = true (alternate still alive in BODY)
3. Ctx2's original has reached ABSORBABLE (PREFIX matching confirmed)
4. Ctx1 started before Ctx2

Note: No count comparison is needed. The alternate being alive
confirms absorption eligibility; the earlier-starting context
always wins.

== Why Wait for Ctx2's Original to Reach ABSORBABLE? ==

We cannot absorb Ctx2 immediately when it starts. We must wait
until Ctx2's original successfully passes through PREFIX elements
(START, SECOND) and reaches the ABSORBABLE point (A+). This
confirms that Ctx2's PREFIX matching succeeded before we remove it.

== Why Absorption is Safe (Common Fate) ==

Once both contexts reach the same ABSORBABLE element (A+), they
share a "common fate":

- Same element position, same future data stream
- Same choices ahead (continue A+ or transition to B)
- If Ctx1 succeeds → Ctx2 would have succeeded too (but shorter match)
- If Ctx1 fails → Ctx2 would have failed too (nothing lost)
- The only difference is in the past (rows already matched),
which doesn't affect future matching decisions

This property guarantees that absorption is safe: we never lose
a potential match by absorbing a later-starting context into an
earlier one.

== Flag Design ==

Compile-time (Element flags):
- RPR_ELEM_ABSORBABLE: Unchanged (marks WHERE comparison happens)
- RPR_ELEM_ABSORBABLE_BRANCH_PREFIX: New (PREFIX region, blocks
absorption)
- RPR_ELEM_ABSORBABLE_BRANCH_BODY: New (BODY region, allows absorption)

Runtime (State variables):
- can_absorb: "Can absorb others" (original=true, alternate=false)
- can_be_absorbed: "Can be absorbed" (PREFIX→false, BODY→true)

Both flags are monotonic: once false, they cannot become true again.

== Complexity Analysis ==

With this extension:
- Maximum concurrent contexts: PREFIX_length + 1
- For "START SECOND A+ B": max 3 contexts at any time
- Overall complexity: O(n) maintained

Example flow for "START SECOND A+ B" with data r0-r4(A,A,A,A,B):
Row 0: Ctx1 starts [1]
Row 1: Ctx2 starts (cannot absorb: PREFIX not confirmed) [2]
Row 2: Ctx3 starts [3]
Row 3: Ctx4 starts, Ctx2 absorbed (PREFIX confirmed) [3]
Row 4: Ctx1 matches B, Ctx3-5 discarded [1]

Note: This is not an immediate priority. Stabilizing the current
RPR patch comes first. I'm sharing this design early to get feedback
and ensure the approach is sound before implementation.

I have a working design document with detailed state transition
rules and examples. Happy to share more details.

Thoughts?

--
Best regards,
Henson

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2026-01-21 03:50:07 Re: Simplify code building the LR conflict messages
Previous Message David G. Johnston 2026-01-21 03:08:36 Re: docs: clarify ALTER TABLE behavior on partitioned tables