| From: | Florents Tselai <florents(dot)tselai(at)gmail(dot)com> |
|---|---|
| To: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
| Cc: | pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Patch: Add tsmatch JSONPath operator for granular Full Text Search |
| Date: | 2026-04-04 09:38:18 |
| Message-ID: | CA+v5N41hB-HhZe_3=Ntp3nsiJqf7vLX+hj0o4HhPKWK+2qfReA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Here's a v5 which is v4 + pgindent
On Fri, Apr 3, 2026 at 12:56 PM Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
wrote:
>
>
>
> On Mon, Mar 2, 2026 at 5:44 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>
>>
>>
>> > On Feb 27, 2026, at 13:59, Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
>> wrote:
>> >
>> >
>> >
>> > On Thu, Feb 26, 2026 at 8:48 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>> >
>> >
>> > > On Feb 1, 2026, at 19:02, Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
>> wrote:
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Jan 26, 2026 at 7:22 PM Florents Tselai <
>> florents(dot)tselai(at)gmail(dot)com> wrote:
>> > > Hi,
>> > >
>> > > in real-life I work a lot with json & fts search, here's a feature
>> I've always wished I had,
>> > > but never tackle it. Until yesterday that is.
>> > >
>> > > SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" &&
>> @.body tsmatch "performance")');
>> > >
>> > > This patch introduces a tsmatch boolean operator to the JSONPath
>> engine.
>> > > By integrating FTS natively into path expressions,
>> > > this operator allows for high-precision filtering of nested JSONB
>> structures—
>> > > solving issues with structural ambiguity and query complexity.
>> > >
>> > > Currently, users must choose between two suboptimal paths for FTS-ing
>> nested JSON:
>> > > - Imprecise Global Indexing
>> > > jsonb_to_tsvector aggregates text into a flat vector.
>> > > This ignores JSON boundaries, leading to false positives when the
>> same key (e.g., "body")
>> > > appears in different contexts (e.g., a "Product Description" vs. a
>> "Customer Review").
>> > >
>> > > - Complex SQL Workarounds
>> > > Achieving 100% precision requires unnesting the document via
>> jsonb_array_elements and LATERAL joins.
>> > > This leads to verbose SQL and high memory overhead from generating
>> intermediate heap tuples.
>> > >
>> > > One of the most significant advantages of tsmatch is its ability to
>> participate in multi-condition predicates
>> > > within the same JSON object - something jsonb_to_tsvector cannot do.
>> > >
>> > > SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" &&
>> @.body tsmatch "performance")');
>> > >
>> > > In a flat vector, the association between "Alice" and "performance"
>> is lost.
>> > > tsmatch preserves this link by evaluating the FTS predicate in-place
>> during path traversal.
>> > >
>> > > While the SQL/JSON standard (ISO/IEC 9075-2) does not explicitly
>> define an FTS operator,
>> > > tsmatch is architecturally modeled after the standard-defined
>> like_regex.
>> > >
>> > > The implementation follows the like_regex precedent:
>> > > it is a non-indexable predicate that relies on GIN path-matching for
>> pruning and heap re-checks for precision.
>> > > Caching is scoped to the JsonPathExecContext,
>> > > ensuring 'compile-once' efficiency per execution without violating
>> the stability requirements of prepared statements.
>> > >
>> > > This initial implementation uses plainto_tsquery.
>> > > However, the grammar is designed to support a "mode" flag (similar to
>> like_regex flags)
>> > > in future iterations to toggle between to_tsquery,
>> websearch_to_tsquery, and phraseto_tsquery.
>> > >
>> > > Here's a v2, that implements the tsqparser clause
>> > >
>> > > So this should now work too
>> > >
>> > > select jsonb_path_query_array('["fast car", "slow car", "fast and
>> furious"]', '$[*] ? (@ tsmatch "fast car" tsqparser "w")
>> <v2-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch>
>> >
>> > Hi Florents,
>> >
>> > Grant pinged me about this. I can review it in coming days. Can you
>> please rebase it? I failed to apply to current master. Also, the CF
>> reported a failure test case, please take a look.
>> >
>> > Hi Evan,
>> > thanks for having a look. The conflict was due to the intro of
>> pg_fallthrough. Not related to this patch .
>> >
>> > I noticed the failure too, but I'm having a hard time reproducing it
>> tbh.
>> > This fails for Debian Trixie with Meson. The same with Autoconf
>> passes...
>> >
>> > https://github.com/Florents-Tselai/postgres/runs/65098077968
>> >
>> >
>> >
>> >
>> > <v3-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch>
>>
>> I have reviewed v3 and traced a few test cases. Here comes my review
>> comments:
>>
>> 1
>> ```
>> + <replaceable>string</replaceable> <literal>tsmatch</literal>
>> <replaceable>string</replaceable>
>> + <optional> <literal>tsconfig</literal>
>> <replaceable>string</replaceable> </optional>
>> + <optional> <literal>tsqparser</literal>
>> <replaceable>string</replaceable> </optional>
>> ```
>>
>> For all “replaceable”, instead of “string”, would it be better to use
>> something more descriptive? For example:
>> ```
>> <replaceable>json_string</replaceable> <literal>tsmatch</literal>
>> <replaceable>query</replaceable>
>> <optional> <literal>tsconfig</literal>
>> <replaceable>config_name</replaceable> </optional>
>> <optional> <literal>tsqparser</literal>
>> <replaceable>parser_mode</replaceable> </optional>
>> ```
>>
>> 2 - jsonpath_gram.y
>> ```
>> +static bool makeItemTsMatch(JsonPathParseItem *doc,
>> + JsonPathString
>> *tsquery,
>> + JsonPathString
>> *tsconfig,
>> + JsonPathString
>> *tsquery_parser,
>> +
>> JsonPathParseItem ** result,
>> + struct Node
>> *escontext);
>> ```
>>
>> Format Nit: Looking at the existing code, the J in the second and
>> following lines, should be placed in the same column as the J in the first
>> line.
>>
>> 3 - jsonpath_gram.y
>> ```
>> + | expr TSMATCH_P STRING_P
>> + {
>> + JsonPathParseItem *jppitem;
>> + /* Pass NULL for tsconfig (3rd) and NULL for
>> tsquery_parser (4th) */
>> + if (! makeItemTsMatch($1, &$3, NULL, NULL, &jppitem,
>> escontext))
>> + YYABORT;
>> + $$ = jppitem;
>> + }
>> + | expr TSMATCH_P STRING_P TSCONFIG_P STRING_P
>> + {
>> + JsonPathParseItem *jppitem;
>> + /* Pass NULL for tsquery_parser (4th) */
>> + if (! makeItemTsMatch($1, &$3, &$5, NULL, &jppitem,
>> escontext))
>> + YYABORT;
>> + $$ = jppitem;
>> + }
>> + | expr TSMATCH_P STRING_P TSQUERYPARSER_P STRING_P
>> + {
>> + JsonPathParseItem *jppitem;
>> + /* Pass NULL for tsconfig (3rd) */
>> + if (! makeItemTsMatch($1, &$3, NULL, &$5, &jppitem,
>> escontext))
>> + YYABORT;
>> + $$ = jppitem;
>> + }
>> + | expr TSMATCH_P STRING_P TSCONFIG_P STRING_P TSQUERYPARSER_P
>> STRING_P
>> + {
>> + JsonPathParseItem *jppitem;
>> + if (! makeItemTsMatch($1, &$3, &$5, &$7, &jppitem,
>> escontext))
>> + YYABORT;
>> + $$ = jppitem;
>> + }
>> ```
>>
>> Feels a little redundant, repeatedly calls makeItemTsMatch. See the
>> attached diff for a simplification. But my version is a bit longer in terms
>> of number of lines. So, up to you.
>>
>> 4 - jsonpath_gram.y
>> ```
>> +static bool
>> +makeItemTsMatch(JsonPathParseItem *doc,
>> + JsonPathString *tsquery,
>> + JsonPathString *tsconfig,
>> + JsonPathString *tsquery_parser,
>> + JsonPathParseItem **result,
>> + struct Node *escontext)
>> ```
>>
>> makeItemTsMatch doesn’t need to return a bool. Actually, now it never
>> returns false, instead, it just ereport(ERROR).
>>
>> 5 - jsonpath.h
>> ```
>> + struct
>> + {
>> + int32 doc;
>> + char *tsquery;
>> + uint32 tsquerylen;
>> + int32 tsconfig;
>> + char *tsqparser;
>> + uint32 tsqparser_len;
>> + } tsmatch;
>>
>> + struct
>> + {
>> + JsonPathParseItem *doc;
>> + char *tsquery;
>> + uint32 tsquerylen;
>> + JsonPathParseItem *tsconfig;
>> + char *tsqparser;
>> + uint32 tsqparser_len;
>> + } tsmatch;
>> } value;
>> ```
>>
>> tsquerylen doesn’t have _ before len, and tsqparser_len, would it be
>> better to make naming conventions consistent in the same structure?
>>
>> 6 - jsonpath_exec.c
>> ```
>> #include "tsearch/ts_utils.h"
>> #include "tsearch/ts_cache.h"
>> #include "utils/regproc.h"
>> #include "catalog/namespace.h"
>>
>> static JsonPathBool
>> executeTsMatch(JsonPathItem *jsp, JsonbValue *str, JsonbValue *rarg,
>> void *param)
>> ```
>>
>> Why don’t put these includes to the header section together with other
>> includes?
>>
>> 7 - jsonpath_exec.c
>> ```
>> + else
>> + {
>> + /*
>> + * Fallback or Error for unknown flags
>> (should be caught by
>> + * parser)
>> + */
>> + ereport(ERROR,
>> +
>> (errcode(ERRCODE_SYNTAX_ERROR),
>> + errmsg("unrecognized
>> tsqparser flag")));
>> + }
>> ```
>>
>> This “else” should never be entered as the same check has been done by
>> makeItemTsMatch. So, maybe just use an Assert here, or pg_unreachable().
>>
>> 8 - jsonpath_exec.c
>> ```
>> + /* Setup Context (Run ONLY once per predicate) */
>> + if (!cxt->initialized)
>> ```
>>
>> While tracing this SQL:
>> ```
>> evantest=# SELECT '{"tags": ["running", "jogging"]}'::jsonb
>> evantest-# @@ '$.tags[*] ? (@ tsmatch "run" tsconfig "english")';
>> ?column?
>> ----------
>>
>> (1 row)
>> ```
>>
>> I noticed that, when process “jogging”, cxt->initialized is still false,
>> meaning that, the cxt is not reused across array items. Given the same
>> tsconfig should apply to all array items, I think cxt should be reused.
>>
>> 9 - jsonpath_exec.c
>> ```
>> + /* Select Parser and Compile Query */
>> + parser_mode = jsp->content.tsmatch.tsqparser;
>> + parser_len = jsp->content.tsmatch.tsqparser_len;
>> +
>> + if (parser_len > 0)
>> + {
>> + /* Dispatch based on flag */
>> + if (pg_strncasecmp(parser_mode, "pl", parser_len)
>> == 0)
>> ```
>>
>> Nit: parser_mode is only used inside if (parser_len > 0), it can be
>> defined inside the “if”.
>>
>> 10 - jsonpath_gram.y
>> ```
>> + ereport(ERROR,
>> + (errcode(ERRCODE_SYNTAX_ERROR),
>> + errmsg("invalid tsquery_parser
>> value: \"%s\"", tsquery_parser->val),
>> + errhint("Valid values are
>> \"pl\", \"ph\", and \"w\".")));
>> ```
>>
>> When tested a case with an invalid parser, I got:
>> ```
>> evantest=# SELECT '{"tags": ["running", "jogging"]}'::jsonb
>>
>> @? '$.tags[*] ? (@ tsmatch "run" tsconfig "english" tsqparser
>> "pss")';
>> ERROR: invalid tsquery_parser value: "pss @"
>> LINE 2: @? '$.tags[*] ? (@ tsmatch "run" tsconfig "english" tsqpar...
>> ^
>> HINT: Valid values are "pl", "ph", and "w".
>> ```
>>
>> You can see the it shows a bad looking invalid value. I think that’s
>> because tsquery_parser->val is not NULL terminated. I fixed this problem
>> with:
>> ```
>> errmsg("invalid tsquery_parser value: \"%.*s\"", (int)
>> tsquery_parser->len, tsquery_parser->val),
>> ```
>>
>> This change is also included in the attached diff file.
>>
>> 11 - jsonpath.c
>> ```
>> + if (printBracketes)
>> + appendStringInfoChar(buf, ')');
>> + break;
>> +
>> if (printBracketes)
>> appendStringInfoChar(buf, ')');
>> ```
>>
>> Duplicate code. Looks like a copy-pasto.
>>
>> 12 - jsonpath.c
>> ```
>> + /* Write the Main Query String */
>> + appendBinaryStringInfo(buf,
>> +
>> &item->value.tsmatch.tsquerylen,
>> +
>> sizeof(item->value.tsmatch.tsquerylen));
>> + appendBinaryStringInfo(buf,
>> +
>> item->value.tsmatch.tsquery,
>> +
>> item->value.tsmatch.tsquerylen);
>> + appendStringInfoChar(buf, '\0');
>> ```
>>
>> I don’t think we need to manually append ‘\0’ after
>> appendBinaryStringInfo. Looking at the header comment of
>> appendBinaryStringInfo, it says that a trailing null will be added.
>> ```
>> /*
>> * appendBinaryStringInfo
>> *
>> * Append arbitrary binary data to a StringInfo, allocating more space
>> * if necessary. Ensures that a trailing null byte is present.
>> */
>> void
>> appendBinaryStringInfo(StringInfo str, const void *data, int datalen)
>> ```
>>
>
> Here's a v4 which incorporates most of Evan's comments & feedback
> - shifted tsquery compilation logic to use persistent cache within
> JsonPathExecContext
> - Fixed a binary serialization alignment issue which caused Dixie to fail
> earlier
> - I've refactor and simplified the grammar per Evan's input by adding a
> tsmatch_opts rule.
> - Also updated the docs per Evan's comments
>
>
>
| Attachment | Content-Type | Size |
|---|---|---|
| v5-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch | application/octet-stream | 33.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Lukas Fittl | 2026-04-04 09:43:50 | Re: Stack-based tracking of per-node WAL/buffer usage |
| Previous Message | Andrei Lepikhov | 2026-04-04 09:34:35 | Re: pg_plan_advice |