Re: Empty string in lexeme for tsvector

From: Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To: Jean-Christophe Arnu <jcarnu(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Empty string in lexeme for tsvector
Date: 2021-09-24 11:03:38
Message-ID: CAEudQAo+M-_N6KGHjaOqJcNgfCQPLi_trFb1Qy0ySkbjqzvL3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Em sex., 24 de set. de 2021 às 05:47, Jean-Christophe Arnu <jcarnu(at)gmail(dot)com>
escreveu:

> Hello Hackers,
>
> This is my second proposal for a patch, so I hope not to make "rookie"
> mistakes.
>
> This proposal patch is based on a simple use case :
>
> If one creates a table this way
> CREATE TABLE tst_table AS (SELECT
> array_to_tsvector(ARRAY['','abc','def']));
>
> the table content is :
> array_to_tsvector
> -------------------
> '' 'abc' 'def'
> (1 row)
>
> First it can be strange to have an empty string for tsvector lexeme but
> anyway, keep going to the real point.
>
> Once dumped, this table dump contains that empty string that can't be
> restored.
> tsvector_parse (./utils/adt/tsvector_parser.c) raises an error.
>
> Thus it is not possible for data to be restored this way.
>
> There are two ways to consider this : is it alright to have empty strings
> in lexemes ?
> * If so, empty strings should be correctly parsed by tsvector_parser.
> * If not, one should prevent empty strings from being stored into
> tsvectors.
>
> Since "empty strings" seems not to be a valid lexeme, I undertook to
> change some functions dealing with tsvector to check whether string
> arguments are empty. This might be the wrong path as I'm not familiar with
> tsvector usage... (OTOH, I can provide a fix patch for tsvector_parser() if
> I'm wrong).
>
> This involved changing the way functions like array_to_tsvector(),
> ts_delete() and setweight() behave. As for NULL values, empty string values
> are checked and an error is raised for such a value. It appears to me that
> ERRCODE_ZERO_LENGTH_CHARACTER_STRING (2200F) matched this behaviour but I
> may be wrong.
>
> Since this patch changes the way functions behave, consider it as a simple
> try to overcome a strange situation we've noticed on a specific use case.
>
> This included patch manages that checks for empty strings on the pointed
> out functions. It comes with modified regression tests. Patch applies along
> head/master and 14_RC1.
>
> Comments are more than welcome!
>
1. Would be better to add this test-and-error before tsvector_bsearch call.

+ if (lex_len == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+ errmsg("lexeme array may not contain empty strings")));
+

If lex_len is equal to zero, better get out soon.

2. The second test-and-error can use lex_len, just like the first test,
I don't see the point in recalculating the size of lex_len if that's
already done.

+ if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+ errmsg("lexeme array may not contain empty strings")));
+

regards,
Ranier Vilela

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Antonin Houska 2021-09-24 11:16:18 Re: POC: Cleaning up orphaned files using undo logs
Previous Message Amit Kapila 2021-09-24 11:01:00 Re: Skipping logical replication transactions on subscriber side