Re: Patch: Add tsmatch JSONPath operator for granular Full Text Search

From: Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: Patch: Add tsmatch JSONPath operator for granular Full Text Search
Date: 2026-04-05 09:15:42
Message-ID: CA+v5N432xJwp7MYLU2Gau3XkUFE6cgKDbS7fOFycKvZUdH-9MQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Apr 4, 2026 at 12:38 PM Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
wrote:

> Here's a v5 which is v4 + pgindent
>
>
>
>
> On Fri, Apr 3, 2026 at 12:56 PM Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
> wrote:
>
>>
>>
>>
>> On Mon, Mar 2, 2026 at 5:44 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>>
>>>
>>>
>>> > On Feb 27, 2026, at 13:59, Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
>>> wrote:
>>> >
>>> >
>>> >
>>> > On Thu, Feb 26, 2026 at 8:48 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
>>> wrote:
>>> >
>>> >
>>> > > On Feb 1, 2026, at 19:02, Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
>>> wrote:
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Mon, Jan 26, 2026 at 7:22 PM Florents Tselai <
>>> florents(dot)tselai(at)gmail(dot)com> wrote:
>>> > > Hi,
>>> > >
>>> > > in real-life I work a lot with json & fts search, here's a feature
>>> I've always wished I had,
>>> > > but never tackle it. Until yesterday that is.
>>> > >
>>> > > SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" &&
>>> @.body tsmatch "performance")');
>>> > >
>>> > > This patch introduces a tsmatch boolean operator to the JSONPath
>>> engine.
>>> > > By integrating FTS natively into path expressions,
>>> > > this operator allows for high-precision filtering of nested JSONB
>>> structures—
>>> > > solving issues with structural ambiguity and query complexity.
>>> > >
>>> > > Currently, users must choose between two suboptimal paths for
>>> FTS-ing nested JSON:
>>> > > - Imprecise Global Indexing
>>> > > jsonb_to_tsvector aggregates text into a flat vector.
>>> > > This ignores JSON boundaries, leading to false positives when the
>>> same key (e.g., "body")
>>> > > appears in different contexts (e.g., a "Product Description" vs. a
>>> "Customer Review").
>>> > >
>>> > > - Complex SQL Workarounds
>>> > > Achieving 100% precision requires unnesting the document via
>>> jsonb_array_elements and LATERAL joins.
>>> > > This leads to verbose SQL and high memory overhead from generating
>>> intermediate heap tuples.
>>> > >
>>> > > One of the most significant advantages of tsmatch is its ability to
>>> participate in multi-condition predicates
>>> > > within the same JSON object - something jsonb_to_tsvector cannot do.
>>> > >
>>> > > SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" &&
>>> @.body tsmatch "performance")');
>>> > >
>>> > > In a flat vector, the association between "Alice" and "performance"
>>> is lost.
>>> > > tsmatch preserves this link by evaluating the FTS predicate in-place
>>> during path traversal.
>>> > >
>>> > > While the SQL/JSON standard (ISO/IEC 9075-2) does not explicitly
>>> define an FTS operator,
>>> > > tsmatch is architecturally modeled after the standard-defined
>>> like_regex.
>>> > >
>>> > > The implementation follows the like_regex precedent:
>>> > > it is a non-indexable predicate that relies on GIN path-matching for
>>> pruning and heap re-checks for precision.
>>> > > Caching is scoped to the JsonPathExecContext,
>>> > > ensuring 'compile-once' efficiency per execution without violating
>>> the stability requirements of prepared statements.
>>> > >
>>> > > This initial implementation uses plainto_tsquery.
>>> > > However, the grammar is designed to support a "mode" flag (similar
>>> to like_regex flags)
>>> > > in future iterations to toggle between to_tsquery,
>>> websearch_to_tsquery, and phraseto_tsquery.
>>> > >
>>> > > Here's a v2, that implements the tsqparser clause
>>> > >
>>> > > So this should now work too
>>> > >
>>> > > select jsonb_path_query_array('["fast car", "slow car", "fast and
>>> furious"]', '$[*] ? (@ tsmatch "fast car" tsqparser "w")
>>> <v2-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch>
>>> >
>>> > Hi Florents,
>>> >
>>> > Grant pinged me about this. I can review it in coming days. Can you
>>> please rebase it? I failed to apply to current master. Also, the CF
>>> reported a failure test case, please take a look.
>>> >
>>> > Hi Evan,
>>> > thanks for having a look. The conflict was due to the intro of
>>> pg_fallthrough. Not related to this patch .
>>> >
>>> > I noticed the failure too, but I'm having a hard time reproducing it
>>> tbh.
>>> > This fails for Debian Trixie with Meson. The same with Autoconf
>>> passes...
>>> >
>>> > https://github.com/Florents-Tselai/postgres/runs/65098077968
>>> >
>>> >
>>> >
>>> >
>>> > <v3-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch>
>>>
>>> I have reviewed v3 and traced a few test cases. Here comes my review
>>> comments:
>>>
>>> 1
>>> ```
>>> + <replaceable>string</replaceable> <literal>tsmatch</literal>
>>> <replaceable>string</replaceable>
>>> + <optional> <literal>tsconfig</literal>
>>> <replaceable>string</replaceable> </optional>
>>> + <optional> <literal>tsqparser</literal>
>>> <replaceable>string</replaceable> </optional>
>>> ```
>>>
>>> For all “replaceable”, instead of “string”, would it be better to use
>>> something more descriptive? For example:
>>> ```
>>> <replaceable>json_string</replaceable> <literal>tsmatch</literal>
>>> <replaceable>query</replaceable>
>>> <optional> <literal>tsconfig</literal>
>>> <replaceable>config_name</replaceable> </optional>
>>> <optional> <literal>tsqparser</literal>
>>> <replaceable>parser_mode</replaceable> </optional>
>>> ```
>>>
>>> 2 - jsonpath_gram.y
>>> ```
>>> +static bool makeItemTsMatch(JsonPathParseItem *doc,
>>> + JsonPathString
>>> *tsquery,
>>> + JsonPathString
>>> *tsconfig,
>>> + JsonPathString
>>> *tsquery_parser,
>>> +
>>> JsonPathParseItem ** result,
>>> + struct Node
>>> *escontext);
>>> ```
>>>
>>> Format Nit: Looking at the existing code, the J in the second and
>>> following lines, should be placed in the same column as the J in the first
>>> line.
>>>
>>> 3 - jsonpath_gram.y
>>> ```
>>> + | expr TSMATCH_P STRING_P
>>> + {
>>> + JsonPathParseItem *jppitem;
>>> + /* Pass NULL for tsconfig (3rd) and NULL for
>>> tsquery_parser (4th) */
>>> + if (! makeItemTsMatch($1, &$3, NULL, NULL, &jppitem,
>>> escontext))
>>> + YYABORT;
>>> + $$ = jppitem;
>>> + }
>>> + | expr TSMATCH_P STRING_P TSCONFIG_P STRING_P
>>> + {
>>> + JsonPathParseItem *jppitem;
>>> + /* Pass NULL for tsquery_parser (4th) */
>>> + if (! makeItemTsMatch($1, &$3, &$5, NULL, &jppitem,
>>> escontext))
>>> + YYABORT;
>>> + $$ = jppitem;
>>> + }
>>> + | expr TSMATCH_P STRING_P TSQUERYPARSER_P STRING_P
>>> + {
>>> + JsonPathParseItem *jppitem;
>>> + /* Pass NULL for tsconfig (3rd) */
>>> + if (! makeItemTsMatch($1, &$3, NULL, &$5, &jppitem,
>>> escontext))
>>> + YYABORT;
>>> + $$ = jppitem;
>>> + }
>>> + | expr TSMATCH_P STRING_P TSCONFIG_P STRING_P TSQUERYPARSER_P
>>> STRING_P
>>> + {
>>> + JsonPathParseItem *jppitem;
>>> + if (! makeItemTsMatch($1, &$3, &$5, &$7, &jppitem,
>>> escontext))
>>> + YYABORT;
>>> + $$ = jppitem;
>>> + }
>>> ```
>>>
>>> Feels a little redundant, repeatedly calls makeItemTsMatch. See the
>>> attached diff for a simplification. But my version is a bit longer in terms
>>> of number of lines. So, up to you.
>>>
>>> 4 - jsonpath_gram.y
>>> ```
>>> +static bool
>>> +makeItemTsMatch(JsonPathParseItem *doc,
>>> + JsonPathString *tsquery,
>>> + JsonPathString *tsconfig,
>>> + JsonPathString *tsquery_parser,
>>> + JsonPathParseItem **result,
>>> + struct Node *escontext)
>>> ```
>>>
>>> makeItemTsMatch doesn’t need to return a bool. Actually, now it never
>>> returns false, instead, it just ereport(ERROR).
>>>
>>> 5 - jsonpath.h
>>> ```
>>> + struct
>>> + {
>>> + int32 doc;
>>> + char *tsquery;
>>> + uint32 tsquerylen;
>>> + int32 tsconfig;
>>> + char *tsqparser;
>>> + uint32 tsqparser_len;
>>> + } tsmatch;
>>>
>>> + struct
>>> + {
>>> + JsonPathParseItem *doc;
>>> + char *tsquery;
>>> + uint32 tsquerylen;
>>> + JsonPathParseItem *tsconfig;
>>> + char *tsqparser;
>>> + uint32 tsqparser_len;
>>> + } tsmatch;
>>> } value;
>>> ```
>>>
>>> tsquerylen doesn’t have _ before len, and tsqparser_len, would it be
>>> better to make naming conventions consistent in the same structure?
>>>
>>> 6 - jsonpath_exec.c
>>> ```
>>> #include "tsearch/ts_utils.h"
>>> #include "tsearch/ts_cache.h"
>>> #include "utils/regproc.h"
>>> #include "catalog/namespace.h"
>>>
>>> static JsonPathBool
>>> executeTsMatch(JsonPathItem *jsp, JsonbValue *str, JsonbValue *rarg,
>>> void *param)
>>> ```
>>>
>>> Why don’t put these includes to the header section together with other
>>> includes?
>>>
>>> 7 - jsonpath_exec.c
>>> ```
>>> + else
>>> + {
>>> + /*
>>> + * Fallback or Error for unknown flags
>>> (should be caught by
>>> + * parser)
>>> + */
>>> + ereport(ERROR,
>>> +
>>> (errcode(ERRCODE_SYNTAX_ERROR),
>>> + errmsg("unrecognized
>>> tsqparser flag")));
>>> + }
>>> ```
>>>
>>> This “else” should never be entered as the same check has been done by
>>> makeItemTsMatch. So, maybe just use an Assert here, or pg_unreachable().
>>>
>>> 8 - jsonpath_exec.c
>>> ```
>>> + /* Setup Context (Run ONLY once per predicate) */
>>> + if (!cxt->initialized)
>>> ```
>>>
>>> While tracing this SQL:
>>> ```
>>> evantest=# SELECT '{"tags": ["running", "jogging"]}'::jsonb
>>> evantest-# @@ '$.tags[*] ? (@ tsmatch "run" tsconfig "english")';
>>> ?column?
>>> ----------
>>>
>>> (1 row)
>>> ```
>>>
>>> I noticed that, when process “jogging”, cxt->initialized is still false,
>>> meaning that, the cxt is not reused across array items. Given the same
>>> tsconfig should apply to all array items, I think cxt should be reused.
>>>
>>> 9 - jsonpath_exec.c
>>> ```
>>> + /* Select Parser and Compile Query */
>>> + parser_mode = jsp->content.tsmatch.tsqparser;
>>> + parser_len = jsp->content.tsmatch.tsqparser_len;
>>> +
>>> + if (parser_len > 0)
>>> + {
>>> + /* Dispatch based on flag */
>>> + if (pg_strncasecmp(parser_mode, "pl",
>>> parser_len) == 0)
>>> ```
>>>
>>> Nit: parser_mode is only used inside if (parser_len > 0), it can be
>>> defined inside the “if”.
>>>
>>> 10 - jsonpath_gram.y
>>> ```
>>> + ereport(ERROR,
>>> + (errcode(ERRCODE_SYNTAX_ERROR),
>>> + errmsg("invalid tsquery_parser
>>> value: \"%s\"", tsquery_parser->val),
>>> + errhint("Valid values are
>>> \"pl\", \"ph\", and \"w\".")));
>>> ```
>>>
>>> When tested a case with an invalid parser, I got:
>>> ```
>>> evantest=# SELECT '{"tags": ["running", "jogging"]}'::jsonb
>>>
>>> @? '$.tags[*] ? (@ tsmatch "run" tsconfig "english" tsqparser
>>> "pss")';
>>> ERROR: invalid tsquery_parser value: "pss @"
>>> LINE 2: @? '$.tags[*] ? (@ tsmatch "run" tsconfig "english" tsqpar...
>>> ^
>>> HINT: Valid values are "pl", "ph", and "w".
>>> ```
>>>
>>> You can see the it shows a bad looking invalid value. I think that’s
>>> because tsquery_parser->val is not NULL terminated. I fixed this problem
>>> with:
>>> ```
>>> errmsg("invalid tsquery_parser value: \"%.*s\"", (int)
>>> tsquery_parser->len, tsquery_parser->val),
>>> ```
>>>
>>> This change is also included in the attached diff file.
>>>
>>> 11 - jsonpath.c
>>> ```
>>> + if (printBracketes)
>>> + appendStringInfoChar(buf, ')');
>>> + break;
>>> +
>>> if (printBracketes)
>>> appendStringInfoChar(buf, ')');
>>> ```
>>>
>>> Duplicate code. Looks like a copy-pasto.
>>>
>>> 12 - jsonpath.c
>>> ```
>>> + /* Write the Main Query String */
>>> + appendBinaryStringInfo(buf,
>>> +
>>> &item->value.tsmatch.tsquerylen,
>>> +
>>> sizeof(item->value.tsmatch.tsquerylen));
>>> + appendBinaryStringInfo(buf,
>>> +
>>> item->value.tsmatch.tsquery,
>>> +
>>> item->value.tsmatch.tsquerylen);
>>> + appendStringInfoChar(buf, '\0');
>>> ```
>>>
>>> I don’t think we need to manually append ‘\0’ after
>>> appendBinaryStringInfo. Looking at the header comment of
>>> appendBinaryStringInfo, it says that a trailing null will be added.
>>> ```
>>> /*
>>> * appendBinaryStringInfo
>>> *
>>> * Append arbitrary binary data to a StringInfo, allocating more space
>>> * if necessary. Ensures that a trailing null byte is present.
>>> */
>>> void
>>> appendBinaryStringInfo(StringInfo str, const void *data, int datalen)
>>> ```
>>>
>>
>> Here's a v4 which incorporates most of Evan's comments & feedback
>> - shifted tsquery compilation logic to use persistent cache within
>> JsonPathExecContext
>> - Fixed a binary serialization alignment issue which caused Dixie to fail
>> earlier
>> - I've refactor and simplified the grammar per Evan's input by adding a
>> tsmatch_opts rule.
>> - Also updated the docs per Evan's comments
>>
>>
>>
Andrew was kind enough to give some comments off-list.

Here's a v6

- Fixed soft-error path in makeItemTsMatch - ereport replaced with errsave,
return value now checked at the grammar call site with YYABORT
- Fixed pg_strncasecmp prefix match bug in parser-mode validation - "p" was
being accepted as "pl"
- Fixed get_ts_config_oid to throw on unknown tsconfig instead of passing
InvalidOid downstream
- Grammar restructured to make tsconfig and tsqparser order-independent
- Removed unused JsonTsMatchContext struct and some misplaced #includes

Attachment Content-Type Size
v6-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch application/octet-stream 33.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2026-04-05 10:02:01 Re: Options to control remote transactions’ access/deferrable modes in postgres_fdw
Previous Message Matthias van de Meent 2026-04-05 09:06:27 Re: Better shared data structure management and resizable shared data structures