Re: Empty string in lexeme for tsvector

From: Jean-Christophe Arnu <jcarnu(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Empty string in lexeme for tsvector
Date: 2021-09-30 18:34:15
Message-ID: CAHZmTm3PVmAxAmJ9XNSDyb-mU68m0F3j8DXvq9RygqHA93Jf8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you Tom for your review.

Le mer. 29 sept. 2021 à 21:36, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> a écrit :

> Jean-Christophe Arnu <jcarnu(at)gmail(dot)com> writes:
> > [ empty_string_in_tsvector_v4.patch ]
>
> I looked through this patch a bit. I don't agree with adding
> these new error conditions to tsvector_setweight_by_filter and
> tsvector_delete_arr. Those don't prevent bad lexemes from being
> added to tsvectors, so AFAICS they can have no effect other than
> breaking existing applications. In fact, tsvector_delete_arr is
> one thing you could use to fix existing bad tsvectors, so making
> it throw an error seems actually counterproductive.
>

Agreed.
The new patch included here does not change tsvector_setweight_by_filter()
anymore. Tests and docs are also upgraded.
This patch is not ready yet.

> (By the same token, I think there's a good argument for
> tsvector_delete_arr to just ignore nulls, not throw an error.
> That's a somewhat orthogonal issue, though.)
>

Nulls are now ignored in tsvector_delete_arr().

> What I'm wondering about more than that is whether array_to_tsvector
> is the only place that can inject an empty lexeme ... don't we have
> anything else that can add lexemes without going through the parser?
>

I crawled the docs [1] in order to check each tsvector output from
functions and
operators :

* The only fonctions left that may lead to empty strings seem
both json_to_tsvector() and jsonb_to_tsvector(). Both functions use
parsetext
(in ts_parse.c) that seems to behave correctly and don't create "empty
string".
* concat operator "||' allows to compute a ts_vector containing "empty
string" if
one of its operands contains itself an empty string tsvector. This seems
perfectly
correct from the operator point of view... Should we change this
behaviour to
filter out empty strings ?

I also wonder if we should not also consider changing COPY FROM behaviour
on empty string lexemes.
Current version is just crashing on empty string lexemes. Should
we allow them anyway as COPY FROM input (it seems weird not to be able
to re-import dumped data) or "skipping them" just like array_to_tsvector()
does in the patched version (that may lead to database content changes) or
finally should we not change COPY behaviour ?

I admit this is a tricky bunch of questions I'm too rookie to answer.

[1] https://www.postgresql.org/docs/14/functions-textsearch.html

--
Jean-Christophe Arnu

Attachment Content-Type Size
empty_string_in_tsvector_v5.patch text/x-patch 7.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-09-30 18:36:41 002_types.pl fails on some timezones on windows
Previous Message Jaime Casanova 2021-09-30 18:31:16 Re: FETCH FIRST clause PERCENT option