inconsistency in full-text search tokenization

From: Valentin Gatien-Baron <valentin(dot)gatienbaron(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: inconsistency in full-text search tokenization
Date: 2021-05-10 02:22:13
Message-ID: CA+0DEqhdmhie8MMmodE3qNogu0mbrTA+i-vTdjznEZ5fX2CbbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello,

I observe the following:

select to_tsvector('simple', 'bla bla ./aaa bla bla'),
phraseto_tsquery('simple', './aaa'),
to_tsvector('simple', 'bla bla ./aaa bla bla') @@
phraseto_tsquery('simple', './aaa') as matches;
to_tsvector | phraseto_tsquery | matches
------------------------+------------------+---------
'/aaa':3 'bla':1,2,4,5 | './aaa' | f
(1 row)

I expected that any space-separated bit of text in the input can be
selected, turned into a query, and would match the initial text. It's
not the case here because as you can see, './aaa' is tokenized as
'./aaa' at start of text but as '/aaa' after spaces.

I looked for more such cases, and my limited testing only found such
a problem with '.' and '~' at start of text:

select
quote_literal(text1) as qtext1,
quote_literal(text2) as qtext2,
ts_vector1,
ts_vector2,
array(select alias || ':' || quote_literal(token) from ts_debug('simple',
text1)) as ts_debug1,
array(select alias || ':' || quote_literal(token) from ts_debug('simple',
text2)) as ts_debug2,
ts_vector1 @@ phraseto_tsquery(text2) as phraseto_match
from
unnest(array['', ')']) as zz0(prefix),
(select chr(a) as char1 from generate_series(1,128) as s1(a) where (a not
between 49 and 57) and (a not between 65 and 90) and (a not between 98 and
122)) as zz1,
(select chr(a) as char2 from generate_series(1,128) as s1(a) where (a not
between 49 and 57) and (a not between 65 and 90) and (a not between 98 and
122)) as zz2,
(select chr(a) as char3 from generate_series(1,128) as s1(a) where (a not
between 49 and 57) and (a not between 65 and 90) and (a not between 98 and
122)) as zz3,
lateral (select prefix || char1 || char2 || char3 as text1,
prefix || ' ' || char1 || char2 || char3 as text2,
prefix || char1 || char2 || ' ' as text11,
prefix || ' ' || char1 || char2 || ' ' as text22) zz4,
lateral (select to_tsvector('simple', text1) as ts_vector1,
to_tsvector('simple', text2) as ts_vector2,
to_tsvector('simple', text11) as ts_vector11,
to_tsvector('simple', text22) as ts_vector22) as zz8
where
ts_vector1 != ts_vector2
and (ts_vector11 = ts_vector22 or char3 = ' ')
;
qtext1 | qtext2 | ts_vector1 | ts_vector2 | ts_debug1 |
ts_debug2 | phraseto_match
--------+--------+------------+------------+-------------------------+------------------------------------------+----------------
'.. ' | ' .. ' | '..':1 | | {file:'..',"blank:' '"} |
{"blank:' .. '"} | f
'~0 ' | ' ~0 ' | '~0':1 | '0':1 | {file:'~0',"blank:' '"} |
{"blank:' ~'",uint:'0',"blank:' '"} | f
'~_ ' | ' ~_ ' | '~_':1 | | {file:'~_',"blank:' '"} |
{"blank:' ~_ '"} | f
'~a ' | ' ~a ' | '~a':1 | 'a':1 | {file:'~a',"blank:' '"} |
{"blank:' ~'",asciiword:'a',"blank:' '"} | f
'./0' | ' ./0' | './0':1 | '/0':1 | {file:'./0'} |
{"blank:' .'",file:'/0'} | f
'~/0' | ' ~/0' | '~/0':1 | '/0':1 | {file:'~/0'} |
{"blank:' ~'",file:'/0'} | f
'./_' | ' ./_' | './_':1 | '/_':1 | {file:'./_'} |
{"blank:' .'",file:'/_'} | f
'~/_' | ' ~/_' | '~/_':1 | '/_':1 | {file:'~/_'} |
{"blank:' ~'",file:'/_'} | f
'./a' | ' ./a' | './a':1 | '/a':1 | {file:'./a'} |
{"blank:' .'",file:'/a'} | f
'~/a' | ' ~/a' | '~/a':1 | '/a':1 | {file:'~/a'} |
{"blank:' ~'",file:'/a'} | f
(10 rows)

select version();
version

---------------------------------------------------------------------------------------------------------
PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
(1 row)

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2021-05-10 14:55:13 BUG #17002: GPG signature is missing in many redhat repos.
Previous Message Bharath Rupireddy 2021-05-09 13:00:41 Re: BUG #16997: parameter server_encoding's category problem