Re: BUG #15488: Unexpected behaviour of to_tsverctor and ts_query

From: Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: nikolajs(dot)arhipovs(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15488: Unexpected behaviour of to_tsverctor and ts_query
Date: 2018-11-07 11:20:55
Message-ID: dc4c5f92-0eeb-2d5e-32af-653b80816c31@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello,

On 06.11.2018 15:51, PG Bug reporting form wrote:
> I am trying to do prefix search of emails using full text search. Emails
> sometimes contain dot symbols.
> Now, select to_tsquery('simple', 'thing.a:*'); returns a single term as
> expected but
> select to_tsquery('simple', 'the.thing.a:*') suddenly tokenizes my search
> expression into 'the.thing':* & 'a':*. At the same time select
> to_tsquery('simple', 'the.thing.aa:*'); tokenizes as expected into a single
> term.
> Also select to_tsquery('simple', 'the(dot)thing(at)gmail(dot)com'); returns a single
> term as expected, but select to_tsquery('simple', 'the(dot)thing(at)gma:*'); splits
> on @ char and returns two tokens.
> This behavior is not covered in the official documentation.

Yeah, it seems strange, but full text search parser thinks that each
label of a host should have at least two characters.
In first example it thinks that it is a file name:

=# select * from ts_debug('simple', 'thing.a:*');
alias | description | token | dictionaries | dictionary |
lexemes
-------+-------------------+---------+--------------+------------+-----------
file | File or path name | thing.a | {simple} | simple |
{thing.a}
blank | Space symbols | :* | {} | (null) | (null)

Next, it thinks that only first two label are only part of host name:

=# select * from ts_debug('simple', 'the.thing.a:*');
alias | description | token | dictionaries | dictionary |
lexemes
-----------+-----------------+-----------+--------------+------------+-------------
host | Host | the.thing | {simple} | simple |
{the.thing}
blank | Space symbols | . | {} | (null) |
(null)
asciiword | Word, all ASCII | a | {simple} | simple | {a}
blank | Space symbols | :* | {} | (null) |
(null)

And next is host name:

=# select * from ts_debug('simple', 'the.thing.aa:*');
alias | description | token | dictionaries | dictionary |
lexemes
-------+---------------+--------------+--------------+------------+----------------
host | Host | the.thing.aa | {simple} | simple |
{the.thing.aa}
blank | Space symbols | :* | {} | (null) | (null)

It seems that the RFC doesn't have such restriction, if I'm not
mistaken. See:
https://tools.ietf.org/html/rfc1034#section-3.1

And same for emails:

=# select * from ts_debug('simple', 'the(dot)thing(at)gmail(dot)com');
alias | description | token | dictionaries |
dictionary | lexemes
-------+---------------+---------------------+--------------+------------+-----------------------
email | Email address | the(dot)thing(at)gmail(dot)com | {simple} | simple
| {the(dot)thing(at)gmail(dot)com}

=# select * from ts_debug('simple', 'the(dot)thing(at)gmail(dot)c:*');
alias | description | token | dictionaries | dictionary |
lexemes
-------+-------------------+-----------+--------------+------------+-------------
host | Host | the.thing | {simple} | simple |
{the.thing}
blank | Space symbols | @ | {} | (null) | (null)
file | File or path name | gmail.c | {simple} | simple |
{gmail.c}
blank | Space symbols | :* | {} | (null) | (null)

=# select * from ts_debug('simple', 'the(dot)thing(at)gma:*');
alias | description | token | dictionaries | dictionary |
lexemes
-----------+-----------------+-----------+--------------+------------+-------------
host | Host | the.thing | {simple} | simple |
{the.thing}
blank | Space symbols | @ | {} | (null) |
(null)
asciiword | Word, all ASCII | gma | {simple} | simple |
{gma}
blank | Space symbols | :* | {} | (null) |
(null)

It seems that it can be easily fixed, but I'm not sure that it won't
break something.
If you have such host names only in prefix queries (not in documents
itself) I may suggest the following workaround:

=# select to_tsvector('simple', 'the(dot)thing(at)gmail(dot)com') @@
'the(dot)thing(at)gma:*'::tsquery;
?column?
----------
t

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Arthur Zakirov 2018-11-07 11:35:22 Re: BUG #15491: index on function not being used for full text search when querying through a view
Previous Message Etsuro Fujita 2018-11-07 10:03:32 Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT