Re: unexpected result from to_tsvector

From: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
To: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Cc: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-14 13:22:04
Message-ID: CACACo5SMkOU3cYhKHiLcOCkKvkeh9MYqQTbA95apZ38iwPL5qQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 7, 2016 at 10:46 PM, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
wrote:

> Hello,
>
> On 07.03.2016 23:55, Dmitrii Golub wrote:
>
>>
>>
>> Hello,
>>
>> Should we added tests for this case?
>>
>
> I think we should. I have added tests for teodor(at)123-stack(dot)net and
> 123(at)stack(dot)net emails.
>
>
>> 123_reg.ro <http://123_reg.ro> is not valid domain name, bacause of
>> symbol "_"
>>
>> https://tools.ietf.org/html/rfc1035 page 8.
>>
>> Dmitrii Golub
>>
>
> Thank you for the information. Fixed.

Hm... now that doesn't look all that consistent to me (after applying the
patch):

=# select ts_debug('simple', 'aaa(at)123-yyy(dot)zzz');
ts_debug
---------------------------------------------------------------------------
(email,"Email address",aaa(at)123-yyy(dot)zzz,{simple},simple,{aaa(at)123-yyy(dot)zzz})
(1 row)

But:

=# select ts_debug('simple', 'aaa(at)123_yyy(dot)zzz');
ts_debug
---------------------------------------------------------
(asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa})
(blank,"Space symbols",@,{},,)
(uint,"Unsigned integer",123,{simple},simple,{123})
(blank,"Space symbols",_,{},,)
(host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
(5 rows)

One can also see that if we only keep the domain name, the result is
similar:

=# select ts_debug('simple', '123-yyy.zzz');
ts_debug
-------------------------------------------------------
(host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz})
(1 row)

=# select ts_debug('simple', '123_yyy.zzz');
ts_debug
-----------------------------------------------------
(uint,"Unsigned integer",123,{simple},simple,{123})
(blank,"Space symbols",_,{},,)
(host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
(3 rows)

But, this only has to do with 123 being recognized as a number, not with
the underscore:

=# select ts_debug('simple', 'abc_yyy.zzz');
ts_debug
-------------------------------------------------------
(host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz})
(1 row)

=# select ts_debug('simple', '1abc_yyy.zzz');
ts_debug
-------------------------------------------------------
(host,Host,1abc_yyy.zzz,{simple},simple,{1abc_yyy.zzz})
(1 row)

In fact, the 123-yyy.zzz domain is not valid either according to the RFC
(subdomain can't start with a digit), but since we already allow it, should
we not allow 123_yyy.zzz to be recognized as a Host? Then why not
recognize aaa(at)123_yyy(dot)zzz as an email address?

Another option is to prohibit underscore in recognized host names, but this
has more breakage potential IMO.

--
Alex

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2016-03-14 13:23:38 Re: [PATCH] Integer overflow in timestamp[tz]_part() and date/time boundaries check
Previous Message Amit Kapila 2016-03-14 13:18:35 Re: Prepared Statement support for Parallel query