Re: Patch: Add tsmatch JSONPath operator for granular Full Text Search

From: Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Patch: Add tsmatch JSONPath operator for granular Full Text Search
Date: 2026-04-03 09:56:02
Message-ID: CA+v5N41vLidOQzFh1FqninmCRi-2gGvphZU_RVF7qiDyCu4DKg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 2, 2026 at 5:44 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:

>
>
> > On Feb 27, 2026, at 13:59, Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
> wrote:
> >
> >
> >
> > On Thu, Feb 26, 2026 at 8:48 AM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
> >
> >
> > > On Feb 1, 2026, at 19:02, Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
> wrote:
> > >
> > >
> > >
> > >
> > > On Mon, Jan 26, 2026 at 7:22 PM Florents Tselai <
> florents(dot)tselai(at)gmail(dot)com> wrote:
> > > Hi,
> > >
> > > in real-life I work a lot with json & fts search, here's a feature
> I've always wished I had,
> > > but never tackle it. Until yesterday that is.
> > >
> > > SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" &&
> @.body tsmatch "performance")');
> > >
> > > This patch introduces a tsmatch boolean operator to the JSONPath
> engine.
> > > By integrating FTS natively into path expressions,
> > > this operator allows for high-precision filtering of nested JSONB
> structures—
> > > solving issues with structural ambiguity and query complexity.
> > >
> > > Currently, users must choose between two suboptimal paths for FTS-ing
> nested JSON:
> > > - Imprecise Global Indexing
> > > jsonb_to_tsvector aggregates text into a flat vector.
> > > This ignores JSON boundaries, leading to false positives when the same
> key (e.g., "body")
> > > appears in different contexts (e.g., a "Product Description" vs. a
> "Customer Review").
> > >
> > > - Complex SQL Workarounds
> > > Achieving 100% precision requires unnesting the document via
> jsonb_array_elements and LATERAL joins.
> > > This leads to verbose SQL and high memory overhead from generating
> intermediate heap tuples.
> > >
> > > One of the most significant advantages of tsmatch is its ability to
> participate in multi-condition predicates
> > > within the same JSON object - something jsonb_to_tsvector cannot do.
> > >
> > > SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" &&
> @.body tsmatch "performance")');
> > >
> > > In a flat vector, the association between "Alice" and "performance" is
> lost.
> > > tsmatch preserves this link by evaluating the FTS predicate in-place
> during path traversal.
> > >
> > > While the SQL/JSON standard (ISO/IEC 9075-2) does not explicitly
> define an FTS operator,
> > > tsmatch is architecturally modeled after the standard-defined
> like_regex.
> > >
> > > The implementation follows the like_regex precedent:
> > > it is a non-indexable predicate that relies on GIN path-matching for
> pruning and heap re-checks for precision.
> > > Caching is scoped to the JsonPathExecContext,
> > > ensuring 'compile-once' efficiency per execution without violating the
> stability requirements of prepared statements.
> > >
> > > This initial implementation uses plainto_tsquery.
> > > However, the grammar is designed to support a "mode" flag (similar to
> like_regex flags)
> > > in future iterations to toggle between to_tsquery,
> websearch_to_tsquery, and phraseto_tsquery.
> > >
> > > Here's a v2, that implements the tsqparser clause
> > >
> > > So this should now work too
> > >
> > > select jsonb_path_query_array('["fast car", "slow car", "fast and
> furious"]', '$[*] ? (@ tsmatch "fast car" tsqparser "w")
> <v2-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch>
> >
> > Hi Florents,
> >
> > Grant pinged me about this. I can review it in coming days. Can you
> please rebase it? I failed to apply to current master. Also, the CF
> reported a failure test case, please take a look.
> >
> > Hi Evan,
> > thanks for having a look. The conflict was due to the intro of
> pg_fallthrough. Not related to this patch .
> >
> > I noticed the failure too, but I'm having a hard time reproducing it tbh.
> > This fails for Debian Trixie with Meson. The same with Autoconf passes...
> >
> > https://github.com/Florents-Tselai/postgres/runs/65098077968
> >
> >
> >
> >
> > <v3-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch>
>
> I have reviewed v3 and traced a few test cases. Here comes my review
> comments:
>
> 1
> ```
> + <replaceable>string</replaceable> <literal>tsmatch</literal>
> <replaceable>string</replaceable>
> + <optional> <literal>tsconfig</literal>
> <replaceable>string</replaceable> </optional>
> + <optional> <literal>tsqparser</literal>
> <replaceable>string</replaceable> </optional>
> ```
>
> For all “replaceable”, instead of “string”, would it be better to use
> something more descriptive? For example:
> ```
> <replaceable>json_string</replaceable> <literal>tsmatch</literal>
> <replaceable>query</replaceable>
> <optional> <literal>tsconfig</literal>
> <replaceable>config_name</replaceable> </optional>
> <optional> <literal>tsqparser</literal>
> <replaceable>parser_mode</replaceable> </optional>
> ```
>
> 2 - jsonpath_gram.y
> ```
> +static bool makeItemTsMatch(JsonPathParseItem *doc,
> + JsonPathString
> *tsquery,
> + JsonPathString
> *tsconfig,
> + JsonPathString
> *tsquery_parser,
> +
> JsonPathParseItem ** result,
> + struct Node
> *escontext);
> ```
>
> Format Nit: Looking at the existing code, the J in the second and
> following lines, should be placed in the same column as the J in the first
> line.
>
> 3 - jsonpath_gram.y
> ```
> + | expr TSMATCH_P STRING_P
> + {
> + JsonPathParseItem *jppitem;
> + /* Pass NULL for tsconfig (3rd) and NULL for
> tsquery_parser (4th) */
> + if (! makeItemTsMatch($1, &$3, NULL, NULL, &jppitem,
> escontext))
> + YYABORT;
> + $$ = jppitem;
> + }
> + | expr TSMATCH_P STRING_P TSCONFIG_P STRING_P
> + {
> + JsonPathParseItem *jppitem;
> + /* Pass NULL for tsquery_parser (4th) */
> + if (! makeItemTsMatch($1, &$3, &$5, NULL, &jppitem,
> escontext))
> + YYABORT;
> + $$ = jppitem;
> + }
> + | expr TSMATCH_P STRING_P TSQUERYPARSER_P STRING_P
> + {
> + JsonPathParseItem *jppitem;
> + /* Pass NULL for tsconfig (3rd) */
> + if (! makeItemTsMatch($1, &$3, NULL, &$5, &jppitem,
> escontext))
> + YYABORT;
> + $$ = jppitem;
> + }
> + | expr TSMATCH_P STRING_P TSCONFIG_P STRING_P TSQUERYPARSER_P
> STRING_P
> + {
> + JsonPathParseItem *jppitem;
> + if (! makeItemTsMatch($1, &$3, &$5, &$7, &jppitem,
> escontext))
> + YYABORT;
> + $$ = jppitem;
> + }
> ```
>
> Feels a little redundant, repeatedly calls makeItemTsMatch. See the
> attached diff for a simplification. But my version is a bit longer in terms
> of number of lines. So, up to you.
>
> 4 - jsonpath_gram.y
> ```
> +static bool
> +makeItemTsMatch(JsonPathParseItem *doc,
> + JsonPathString *tsquery,
> + JsonPathString *tsconfig,
> + JsonPathString *tsquery_parser,
> + JsonPathParseItem **result,
> + struct Node *escontext)
> ```
>
> makeItemTsMatch doesn’t need to return a bool. Actually, now it never
> returns false, instead, it just ereport(ERROR).
>
> 5 - jsonpath.h
> ```
> + struct
> + {
> + int32 doc;
> + char *tsquery;
> + uint32 tsquerylen;
> + int32 tsconfig;
> + char *tsqparser;
> + uint32 tsqparser_len;
> + } tsmatch;
>
> + struct
> + {
> + JsonPathParseItem *doc;
> + char *tsquery;
> + uint32 tsquerylen;
> + JsonPathParseItem *tsconfig;
> + char *tsqparser;
> + uint32 tsqparser_len;
> + } tsmatch;
> } value;
> ```
>
> tsquerylen doesn’t have _ before len, and tsqparser_len, would it be
> better to make naming conventions consistent in the same structure?
>
> 6 - jsonpath_exec.c
> ```
> #include "tsearch/ts_utils.h"
> #include "tsearch/ts_cache.h"
> #include "utils/regproc.h"
> #include "catalog/namespace.h"
>
> static JsonPathBool
> executeTsMatch(JsonPathItem *jsp, JsonbValue *str, JsonbValue *rarg,
> void *param)
> ```
>
> Why don’t put these includes to the header section together with other
> includes?
>
> 7 - jsonpath_exec.c
> ```
> + else
> + {
> + /*
> + * Fallback or Error for unknown flags
> (should be caught by
> + * parser)
> + */
> + ereport(ERROR,
> +
> (errcode(ERRCODE_SYNTAX_ERROR),
> + errmsg("unrecognized
> tsqparser flag")));
> + }
> ```
>
> This “else” should never be entered as the same check has been done by
> makeItemTsMatch. So, maybe just use an Assert here, or pg_unreachable().
>
> 8 - jsonpath_exec.c
> ```
> + /* Setup Context (Run ONLY once per predicate) */
> + if (!cxt->initialized)
> ```
>
> While tracing this SQL:
> ```
> evantest=# SELECT '{"tags": ["running", "jogging"]}'::jsonb
> evantest-# @@ '$.tags[*] ? (@ tsmatch "run" tsconfig "english")';
> ?column?
> ----------
>
> (1 row)
> ```
>
> I noticed that, when process “jogging”, cxt->initialized is still false,
> meaning that, the cxt is not reused across array items. Given the same
> tsconfig should apply to all array items, I think cxt should be reused.
>
> 9 - jsonpath_exec.c
> ```
> + /* Select Parser and Compile Query */
> + parser_mode = jsp->content.tsmatch.tsqparser;
> + parser_len = jsp->content.tsmatch.tsqparser_len;
> +
> + if (parser_len > 0)
> + {
> + /* Dispatch based on flag */
> + if (pg_strncasecmp(parser_mode, "pl", parser_len)
> == 0)
> ```
>
> Nit: parser_mode is only used inside if (parser_len > 0), it can be
> defined inside the “if”.
>
> 10 - jsonpath_gram.y
> ```
> + ereport(ERROR,
> + (errcode(ERRCODE_SYNTAX_ERROR),
> + errmsg("invalid tsquery_parser
> value: \"%s\"", tsquery_parser->val),
> + errhint("Valid values are
> \"pl\", \"ph\", and \"w\".")));
> ```
>
> When tested a case with an invalid parser, I got:
> ```
> evantest=# SELECT '{"tags": ["running", "jogging"]}'::jsonb
>
> @? '$.tags[*] ? (@ tsmatch "run" tsconfig "english" tsqparser
> "pss")';
> ERROR: invalid tsquery_parser value: "pss @"
> LINE 2: @? '$.tags[*] ? (@ tsmatch "run" tsconfig "english" tsqpar...
> ^
> HINT: Valid values are "pl", "ph", and "w".
> ```
>
> You can see the it shows a bad looking invalid value. I think that’s
> because tsquery_parser->val is not NULL terminated. I fixed this problem
> with:
> ```
> errmsg("invalid tsquery_parser value: \"%.*s\"", (int)
> tsquery_parser->len, tsquery_parser->val),
> ```
>
> This change is also included in the attached diff file.
>
> 11 - jsonpath.c
> ```
> + if (printBracketes)
> + appendStringInfoChar(buf, ')');
> + break;
> +
> if (printBracketes)
> appendStringInfoChar(buf, ')');
> ```
>
> Duplicate code. Looks like a copy-pasto.
>
> 12 - jsonpath.c
> ```
> + /* Write the Main Query String */
> + appendBinaryStringInfo(buf,
> +
> &item->value.tsmatch.tsquerylen,
> +
> sizeof(item->value.tsmatch.tsquerylen));
> + appendBinaryStringInfo(buf,
> +
> item->value.tsmatch.tsquery,
> +
> item->value.tsmatch.tsquerylen);
> + appendStringInfoChar(buf, '\0');
> ```
>
> I don’t think we need to manually append ‘\0’ after
> appendBinaryStringInfo. Looking at the header comment of
> appendBinaryStringInfo, it says that a trailing null will be added.
> ```
> /*
> * appendBinaryStringInfo
> *
> * Append arbitrary binary data to a StringInfo, allocating more space
> * if necessary. Ensures that a trailing null byte is present.
> */
> void
> appendBinaryStringInfo(StringInfo str, const void *data, int datalen)
> ```
>

Here's a v4 which incorporates most of Evan's comments & feedback
- shifted tsquery compilation logic to use persistent cache within
JsonPathExecContext
- Fixed a binary serialization alignment issue which caused Dixie to fail
earlier
- I've refactor and simplified the grammar per Evan's input by adding a
tsmatch_opts rule.
- Also updated the docs per Evan's comments

Attachment Content-Type Size
v4-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch application/octet-stream 33.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message cca5507 2026-04-03 10:06:46 Re: Buffer locking is special (hints, checksums, AIO writes)
Previous Message Amit Langote 2026-04-03 09:39:48 Re: Eliminating SPI / SQL from some RI triggers - take 3