| From: | Henson Choi <assam258(at)gmail(dot)com> |
|---|---|
| To: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
| Cc: | jacob(dot)champion(at)enterprisedb(dot)com, david(dot)g(dot)johnston(at)gmail(dot)com, vik(at)postgresfriends(dot)org, er(at)xs4all(dot)nl, peter(at)eisentraut(dot)org, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Row pattern recognition |
| Date: | 2026-01-21 03:27:15 |
| Message-ID: | CAAAe_zAEg7sVM=WDwXMyE-odGmQyXSVi5ZzWgye6SupSjdMKpg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi hackers,
I'd like to propose an extension to the Row Pattern Recognition (RPR)
absorption mechanism to handle anchored patterns (those starting with
START or similar prefix elements).
== Background ==
The current RPR implementation includes an absorption optimization that
reduces time complexity from O(n²) to O(n) for unbounded quantifiers
like A+. When multiple match contexts are at the same ABSORBABLE element,
the context that started earlier (having matched more rows) absorbs
later-starting contexts, eliminating redundant processing.
For example, with pattern "A+ B":
- Ctx1 starts at row 0, matching A
- Ctx2 starts at row 1, matching A
- Ctx1 always has more A's → Ctx2 is absorbed (removed)
- Result: Only 1 context maintained, O(n) complexity
== The Problem ==
This absorption mechanism breaks down for anchored patterns like
"START SECOND A+ B". The PREFIX elements (START, SECOND) block
absorption because contexts at different PREFIX positions cannot
be meaningfully compared.
Without absorption, we regress to O(n²) behavior as contexts
accumulate without being eliminated.
== Proposed Solution: Original/Alternate Simultaneous Progression ==
The key insight is to track two logical "paths" within each context:
1. Original: Starts at Element 0 (PREFIX + BODY)
- can_absorb = true
- Can be used as matchedState (actual match result)
2. Alternate: Starts at Element 2 (BODY only, skipping PREFIX)
- can_be_absorbed = true
- Used only for absorption eligibility checking
- Cannot be used as matchedState
Both paths process the same row data simultaneously. The alternate
path serves as a "shadow" that indicates whether the context is
eligible for absorption.
== Absorption Conditions ==
For Ctx1 to absorb Ctx2, all conditions must be met:
1. Ctx1.can_absorb = true (original not outside pattern region)
2. Ctx2.can_be_absorbed = true (alternate still alive in BODY)
3. Ctx2's original has reached ABSORBABLE (PREFIX matching confirmed)
4. Ctx1 started before Ctx2
Note: No count comparison is needed. The alternate being alive
confirms absorption eligibility; the earlier-starting context
always wins.
== Why Wait for Ctx2's Original to Reach ABSORBABLE? ==
We cannot absorb Ctx2 immediately when it starts. We must wait
until Ctx2's original successfully passes through PREFIX elements
(START, SECOND) and reaches the ABSORBABLE point (A+). This
confirms that Ctx2's PREFIX matching succeeded before we remove it.
== Why Absorption is Safe (Common Fate) ==
Once both contexts reach the same ABSORBABLE element (A+), they
share a "common fate":
- Same element position, same future data stream
- Same choices ahead (continue A+ or transition to B)
- If Ctx1 succeeds → Ctx2 would have succeeded too (but shorter match)
- If Ctx1 fails → Ctx2 would have failed too (nothing lost)
- The only difference is in the past (rows already matched),
which doesn't affect future matching decisions
This property guarantees that absorption is safe: we never lose
a potential match by absorbing a later-starting context into an
earlier one.
== Flag Design ==
Compile-time (Element flags):
- RPR_ELEM_ABSORBABLE: Unchanged (marks WHERE comparison happens)
- RPR_ELEM_ABSORBABLE_BRANCH_PREFIX: New (PREFIX region, blocks
absorption)
- RPR_ELEM_ABSORBABLE_BRANCH_BODY: New (BODY region, allows absorption)
Runtime (State variables):
- can_absorb: "Can absorb others" (original=true, alternate=false)
- can_be_absorbed: "Can be absorbed" (PREFIX→false, BODY→true)
Both flags are monotonic: once false, they cannot become true again.
== Complexity Analysis ==
With this extension:
- Maximum concurrent contexts: PREFIX_length + 1
- For "START SECOND A+ B": max 3 contexts at any time
- Overall complexity: O(n) maintained
Example flow for "START SECOND A+ B" with data r0-r4(A,A,A,A,B):
Row 0: Ctx1 starts [1]
Row 1: Ctx2 starts (cannot absorb: PREFIX not confirmed) [2]
Row 2: Ctx3 starts [3]
Row 3: Ctx4 starts, Ctx2 absorbed (PREFIX confirmed) [3]
Row 4: Ctx1 matches B, Ctx3-5 discarded [1]
Note: This is not an immediate priority. Stabilizing the current
RPR patch comes first. I'm sharing this design early to get feedback
and ensure the approach is sound before implementation.
I have a working design document with detailed state transition
rules and examples. Happy to share more details.
Thoughts?
--
Best regards,
Henson
| From | Date | Subject | |
|---|---|---|---|
| Next Message | shveta malik | 2026-01-21 03:50:07 | Re: Simplify code building the LR conflict messages |
| Previous Message | David G. Johnston | 2026-01-21 03:08:36 | Re: docs: clarify ALTER TABLE behavior on partitioned tables |