Empty string in lexeme for tsvector

From: Jean-Christophe Arnu <jcarnu(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Empty string in lexeme for tsvector
Date: 2021-09-24 08:46:49
Message-ID: CAHZmTm1YVndPgUVRoag2WL0w900XcoiivDDj-gTTYBsG25c65A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Hackers,

This is my second proposal for a patch, so I hope not to make "rookie"
mistakes.

This proposal patch is based on a simple use case :

If one creates a table this way
CREATE TABLE tst_table AS (SELECT array_to_tsvector(ARRAY['','abc','def']));

the table content is :
array_to_tsvector
-------------------
'' 'abc' 'def'
(1 row)

First it can be strange to have an empty string for tsvector lexeme but
anyway, keep going to the real point.

Once dumped, this table dump contains that empty string that can't be
restored.
tsvector_parse (./utils/adt/tsvector_parser.c) raises an error.

Thus it is not possible for data to be restored this way.

There are two ways to consider this : is it alright to have empty strings
in lexemes ?
* If so, empty strings should be correctly parsed by tsvector_parser.
* If not, one should prevent empty strings from being stored into
tsvectors.

Since "empty strings" seems not to be a valid lexeme, I undertook to change
some functions dealing with tsvector to check whether string arguments are
empty. This might be the wrong path as I'm not familiar with tsvector
usage... (OTOH, I can provide a fix patch for tsvector_parser() if I'm
wrong).

This involved changing the way functions like array_to_tsvector(),
ts_delete() and setweight() behave. As for NULL values, empty string values
are checked and an error is raised for such a value. It appears to me that
ERRCODE_ZERO_LENGTH_CHARACTER_STRING (2200F) matched this behaviour but I
may be wrong.

Since this patch changes the way functions behave, consider it as a simple
try to overcome a strange situation we've noticed on a specific use case.

This included patch manages that checks for empty strings on the pointed
out functions. It comes with modified regression tests. Patch applies along
head/master and 14_RC1.

Comments are more than welcome!
Thank you,

--
Jean-Christophe Arnu

Attachment Content-Type Size
empty_string_in_tsvector_v0.patch text/x-patch 5.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2021-09-24 08:53:10 RE: Skipping logical replication transactions on subscriber side
Previous Message Amit Kapila 2021-09-24 08:39:25 Re: row filtering for logical replication