Re: unexpected result from to_tsvector

From: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
Cc: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-16 11:32:37
Message-ID: 56E94455.6050508@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I found the discussion about allowing an underscore in emails
http://www.postgresql.org/message-id/200908281359.n7SDxfaf044556@wwwmaster.postgresql.org

That bug report is about recognizing an underscore in the local part of
an email. And is not about recognizing an underscore in a domain name.
But that patch allows an underscore in recognized host names also.

I am not good in RFC, so I put excerpt from Wikipedia
https://en.wikipedia.org/wiki/Email_address:

> The local-part of the email address may use any of these ASCII characters:
>
> Uppercase and lowercase Latin letters (A–Z, a–z) (ASCII: 65–90, 97–122)
> Digits 0 to 9 (ASCII: 48–57)
> These special characters: !#$%&'*+-/=?^_`{|}~ (ASCII: 33, 35–39, 42, 43, 45, 47, 61, 63, 94–96, 123–126)
> Character . (dot, period, full stop), ASCII 46, provided that it is not the first or last character, and provided also that it does not appear consecutively (e.g. John(dot)(dot)Doe(at)example(dot)com is not allowed).
> Other special characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash). These characters are:
> Space and "(),:;<>@[\] (ASCII: 32, 34, 40, 41, 44, 58, 59, 60, 62, 64, 91–93)
> Comments are allowed with parentheses at either end of the local part; e.g. john.smith(comment)@example.com and (comment)john(dot)smith(at)example(dot)com are both equivalent to john(dot)smith(at)example(dot)com(dot)

and https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_host_names

> The Internet standards (Requests for Comments) for protocols mandate that component hostname labels may contain only the ASCII letters 'a' through 'z' (in a case-insensitive manner),the digits '0' through '9', and the hyphen ('-'). The original specification of hostnames in RFC 952, mandated that labels could not start with a digit or with a hyphen, and must not end with a hyphen. However, a subsequent specification (RFC 1123) permitted hostname labels to start with digits. No other symbols, punctuation characters, or white space are permitted.

Hence the valid emails is (I might be wrong):

123-s(at)sample(dot)com
123_s(at)sample(dot)com
123(at)123-sample(dot)com
123(at)123sample(dot)com

The attached patch allow them to be recognized as a email. But this
patch does not prohibit underscore in recognized host names.

As a result this patch gives the following results with underscores:

=# select * from ts_debug('simple', 'aaa(at)123_yyy(dot)zzz');
alias | description | token | dictionaries | dictionary |
lexemes
-------+---------------+-----------------+--------------+------------+-------------------
email | Email address | aaa(at)123_yyy(dot)zzz | {simple} | simple |
{aaa(at)123_yyy(dot)zzz}
(1 row)

=# select * from ts_debug('simple', '123_yyy.zzz');
alias | description | token | dictionaries | dictionary |
lexemes
-------+-------------+-------------+--------------+------------+---------------
host | Host | 123_yyy.zzz | {simple} | simple |
{123_yyy.zzz}
(1 row)

On 14.03.2016 17:45, Artur Zakirov wrote:
> On 14.03.2016 16:22, Shulgin, Oleksandr wrote:
>>
>> Hm... now that doesn't look all that consistent to me (after applying
>> the patch):
>>
>> =# select ts_debug('simple', 'aaa(at)123-yyy(dot)zzz');
>> ts_debug
>> ---------------------------------------------------------------------------
>>
>> (email,"Email
>> address",aaa(at)123-yyy(dot)zzz,{simple},simple,{aaa(at)123-yyy(dot)zzz})
>> (1 row)
>>
>> But:
>>
>> =# select ts_debug('simple', 'aaa(at)123_yyy(dot)zzz');
>> ts_debug
>> ---------------------------------------------------------
>> (asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa})
>> (blank,"Space symbols",@,{},,)
>> (uint,"Unsigned integer",123,{simple},simple,{123})
>> (blank,"Space symbols",_,{},,)
>> (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
>> (5 rows)
>>
>> One can also see that if we only keep the domain name, the result is
>> similar:
>>
>> =# select ts_debug('simple', '123-yyy.zzz');
>> ts_debug
>> -------------------------------------------------------
>> (host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz})
>> (1 row)
>>
>> =# select ts_debug('simple', '123_yyy.zzz');
>> ts_debug
>> -----------------------------------------------------
>> (uint,"Unsigned integer",123,{simple},simple,{123})
>> (blank,"Space symbols",_,{},,)
>> (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
>> (3 rows)
>>
>> But, this only has to do with 123 being recognized as a number, not with
>> the underscore:
>>
>> =# select ts_debug('simple', 'abc_yyy.zzz');
>> ts_debug
>> -------------------------------------------------------
>> (host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz})
>> (1 row)
>>
>> =# select ts_debug('simple', '1abc_yyy.zzz');
>> ts_debug
>> -------------------------------------------------------
>> (host,Host,1abc_yyy.zzz,{simple},simple,{1abc_yyy.zzz})
>> (1 row)
>>
>> In fact, the 123-yyy.zzz domain is not valid either according to the RFC
>> (subdomain can't start with a digit), but since we already allow it,
>> should we not allow 123_yyy.zzz to be recognized as a Host? Then why
>> not recognize aaa(at)123_yyy(dot)zzz as an email address?
>>
>> Another option is to prohibit underscore in recognized host names, but
>> this has more breakage potential IMO.
>>
>> --
>> Alex
>>
>
> It seems reasonable to me. I like more first option. But I am not
> confident that we should allow 123_yyy.zzz to be recognized as a Host.
>
> By the way, in this question http://webmasters.stackexchange.com/a/775
> you can see examples of domain names with numbers (but not subdomains).
>
> If there are not objections from others, I will send a new patch today
> later or tomorrow with 123_yyy.zzz recognizing.
>

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
number_in_email_v3.patch text/x-patch 14.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Grzegorz Sampolski 2016-03-16 11:46:54 Re: pam auth - add rhost item
Previous Message David Rowley 2016-03-16 11:08:49 Re: Combining Aggregates