Re: tsvector string representation and parsing

From: Johannes Graën <johannes(at)selfnet(dot)de>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: tsvector string representation and parsing
Date: 2022-02-23 23:41:01
Message-ID: 54593f15-6b3b-086e-ce1e-69f0af9d38a9@selfnet.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Thanks, Tom.

On 23/02/2022 23.30, Tom Lane wrote:
> =?UTF-8?Q?Johannes_Gra=c3=abn?= <johannes(at)selfnet(dot)de> writes:
>> This is a minimal example that goes wrong (but shouldn't IMHO:
>
>>> SELECT format('%L:1', '\:')::tsvector
>
> format(%L) is designed to produce a SQL literal, which does not
> have the same requirements as a tsvector element ... yeah, they're
> close, but not close enough. In this particular example,
> what you get is
>
> =# SELECT format('%L:1', '\:');
> format
> ----------
> E'\\:':1
> (1 row)
>
> because format() adds an E prefix for the avoidance of doubt about
> what to do with the backslashes. tsvector doesn't like that.

I see.

> I don't think we have any prefab function that does what you're
> looking for here, and TBH I'm not sure I see the point of it.
> Pretty much any tsvector you'd be dealing with in practice is
> going to have come from one of the to_tsvector family of
> functions, and those tend to drop punctuation.

Applied to normal texts in a standard language, I believe that's true.
I'd like to use FTS in a setting where I need to control the positional
attributes as I'm specifying more than one lexeme at the same position
(which works great btw). That's why I can't use to_tsvector() but need
to cast it from a string to tsvector.

>> My understanding is that everything inside
>> single quotes is taken as a lexeme. From the documentation:
>
>>> To represent lexemes containing whitespace or punctuation, surround them with quotes
>
> See also the next bit about having to double quotes and backslashes
> within those quotes. So what you'd actually need is
>
> =# select $$'\\:':1$$::tsvector;
> tsvector
> ----------
> '\\:':1
> (1 row)
>
> If you write just one backslash, it has the effect of quoting
> the next character, which in this case doesn't need quoting.

Before using format(), I tried just generating those strings by doubling
any single quote or backslash and enclosing the whole string in single
quotes, but that didn't seem a safe way, though it works in principle:

> SELECT format($$'%s':%s$$, replace(replace(s, $$'$$, $$''$$), '\', '\\'), i)::tsvector
> FROM (SELECT $$\:$$ s, 1 i) x;

Would that be the way to go if to_tsvector is not an option?

Regards
Johannes

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Zheng Li 2022-02-24 03:10:32 Re: Support logical replication of DDLs
Previous Message Tom Lane 2022-02-23 22:30:11 Re: tsvector string representation and parsing