| From: | Henson Choi <assam258(at)gmail(dot)com> |
|---|---|
| To: | Tatsuo Ishii <ishii(at)postgresql(dot)org>, Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com> |
| Cc: | david(dot)g(dot)johnston(at)gmail(dot)com, vik(at)postgresfriends(dot)org, er(at)xs4all(dot)nl, peter(at)eisentraut(dot)org, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Row pattern recognition |
| Date: | 2026-01-14 15:31:32 |
| Message-ID: | CAAAe_zAgyfStcRAdm9j6QrA9p0pOk3scxpmkAkxi3goSdgxo7Q@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Ishii-san,
The attached patch implements streaming NFA-based pattern matching.
This is based on your v37 patch and the pattern grammar work I submitted
earlier.
This is a testable draft for review, not production-ready yet. It allows
in-depth verification of pattern optimization, pattern matching/merging,
context absorption, and window behavior.
Summary of changes:
1. NFA-based executor replacing regex approach
- Streaming execution: processes one row at a time
- Eliminates regex compilation overhead and 26-variable limit
- Supports up to 252 pattern variables
Note on 252 limit:
Each match path stores a sequence of classifiers (one per matched row).
Using 1-byte encoding keeps memory usage reasonable for long matches.
Some values are reserved for pattern representation; this limit may
decrease if more are needed. 200+ should be sufficient for practical use.
2. Pattern optimization moved to planner phase (as you suggested)
- pg_get_viewdef() now shows original pattern
- EXPLAIN shows optimized pattern
- Optimizations: VAR merge (A A A -> A{3}), GROUP merge ((A B) (A B)+ ->
(A B){2,}),
duplicate removal, SEQ/ALT flattening, quantifier multiplication
- These optimizations also enable absorption for patterns that wouldn't
otherwise qualify (e.g., (A B) (A B)+ -> (A B){2,} becomes absorbable)
3. Multi-context support with context absorption
- Multiple concurrent NFA contexts for overlapping matches
- Absorption optimization: earlier context absorbs later ones
when they reach the same endpoint
Note on absorption scope:
I haven't fully worked out the theoretical bounds for safe absorption
yet.
The current implementation covers most practically useful cases, but a
complete solution would require more formal analysis.
Current implementation handles cases where the behavior is clear:
- Only patterns with unbounded first element (A+, A*, (A B)+, etc.)
- Only when there's exactly one unbounded element at top level
- Per-branch analysis for alternation patterns
This may be extended as we better understand the theoretical limits.
Examples:
- A+ B : absorbable (A+ unbounded first, same endpoint guaranteed)
- A* B : absorbable (A* unbounded first)
- A B+ : NOT absorbable (unbounded not first, different start points)
- A+ B+ : partial (absorbable while in A+, not after entering B+)
- A? B : NOT absorbable (? is bounded: {0,1})
- A{2,} B : absorbable (unbounded quantifier at first element)
- A{2,5} B : NOT absorbable (bounded quantifier)
- (A B){2,}: absorbable (GROUP with unbounded quantifier at top level)
- A+ | B+ : absorbable per-branch (each branch analyzed independently)
- A+ | B C : partial per-branch (A+ absorbable, B C not)
- (A+ B)+ : NOT YET (absorbable at first A+ entry, nested analysis not
implemented)
4. Reluctant quantifiers
- Will be supported in a follow-up patch
- I need to study some edge cases in the spec first
Also updated: parser/README, typedefs.list
Testing:
The rpr regression test has been expanded significantly with tests for:
- Alternation and grouping patterns
- Quantifiers ({n}, {n,}, {n,m}, +, *, ?)
- Pattern optimization verification
- Multi-context matching
- Context absorption behavior
- Lexical order compliance
Below is a reference query designed to test NFA functionality in a clean,
unambiguous way. Using explicit ARRAY flags eliminates edge cases and makes
the pattern matching behavior completely predictable, allowing clear
verification
of multi-context execution and alternation handling.
Example (multi-context overlapping match test):
Uses ARRAY flags with ANY() to make each row's pattern membership explicit,
making test behavior predictable and easy to verify.
WITH data AS (
SELECT * FROM (VALUES
(1, ARRAY['A']), (2, ARRAY['B']), (3, ARRAY['C']),
(4, ARRAY['D']), (5, ARRAY['E']), (6, ARRAY['F'])
) AS t(id, flags)
)
SELECT id, flags, first_value(id) OVER w AS start, last_value(id) OVER w
AS end
FROM data
WINDOW w AS (
ORDER BY id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP TO NEXT ROW
PATTERN (A B C D E | B C D | C D E F)
DEFINE A AS 'A' = ANY(flags), B AS 'B' = ANY(flags), ...
);
-- Result (SKIP TO NEXT ROW enables multi-context):
id | flags | start | end
----+-------+-------+-----
1 | {A} | 1 | 5 <- A B C D E
2 | {B} | 2 | 4 <- B C D
3 | {C} | 3 | 6 <- C D E F
4 | {D} | |
5 | {E} | |
6 | {F} | |
Performance:
Regression test timing (rpr test):
Parallel group (with other tests):
Before (v37 regex): 198 ms
After (NFA): 123 ms (~38% faster)
After (NFA + tests): 127 ms (still ~36% faster)
Single test (isolated):
Before (v37 regex): 38 ms
After (NFA): 26 ms (~32% faster)
After (NFA + tests): 41 ms (+8%, more tests added)
Note: The rpr test was moved out of the parallel group in the schedule
for more accurate timing measurement.
The streaming NFA approach eliminates regex compilation overhead per-row
and reduces redundant DEFINE evaluations, resulting in better performance
especially for complex patterns.
Let me know if you have any questions or concerns.
Best regards,
Henson
>
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-NFA-based-pattern-matching.txt | text/plain | 312.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Japin Li | 2026-01-14 15:35:53 | Re: Add IS_INDEX macro to brin and gist index |
| Previous Message | Japin Li | 2026-01-14 15:31:11 | Re: Add IS_INDEX macro to brin and gist index |