Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores

From: "Dan O'Hara" <danarasoftware(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores
Date: 2009-10-22 17:10:07
Message-ID: 557802370910221010k5669e9f0v559213d998e286d3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Thanks for having a look at this bug.

According to section 12.8.2 of the postgres manual, ts_parse is
supposed to recognize different types of data, one of which (#4) is an
email address.

The list of recognized data formats for parse can be selected via this query:

SELECT * FROM ts_token_type('default');

The example in the bug I reported is valid email address, according to
the RFC, but isn't recognized as such by the full text search in
postgres. This bug will have a real impact on anybody using ts
functions to locate email addresses, as only some of them are found in
the query.

Regards
Dan

On Thu, Oct 22, 2009 at 12:29 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftware(at)gmail(dot)com> wrote:
>>
>> The following bug has been logged online:
>>
>> Bug reference:      5021
>> Logged by:          Dan O'Hara
>> Email address:      danarasoftware(at)gmail(dot)com
>> PostgreSQL version: 8.3.7
>> Operating system:   win32
>> Description:        ts_parse doesn't recognize email addresses with
>> underscores
>> Details:
>>
>> In the following example,
>>
>> select distinct token as email
>> from ts_parse('default', ' first_last(at)yahoo(dot)com '   )
>> where tokid = 4
>>
>> ts_parse returns last(at)yahoo(dot)com rather than first_last(at)yahoo(dot)com  It seems
>> that any text prior to the underscore is truncated.  If the portion
>> following the underscore is only numeric, such as this example,
>>
>> select distinct token as email
>> from ts_parse('default', ' bill_2000(at)yahoo(dot)com '   )
>> where tokid = 4
>>
>> then ts_parse returns nothing at all.
>>
>> section 3.2.3 of RFC 5322 indicates that underscores are valid characters in
>> an email address.
>>
>> http://tools.ietf.org/html/rfc5322
>
> I don't think this has much to do with email addresses.  If you do:
>
> select token from ts_parse('a_b');
>
> ...you get three tokens.  In your case you're pulling out the fourth
> token, but some of your examples don't have four tokens, so then you
> get nothing at all.
>
> I'm not real familiar with ts_parse(), but I'm thinking that it
> doesn't have any special casing for email addresses and is just
> intended to parse text for full-text-search - in which case splitting
> on _ is a pretty good algorithm.
>
> ...Robert
>

--
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
danarasoftware(at)gmail(dot)com
613 288-8733

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2009-10-22 18:39:35 Re: BUG #5039: 'i' flag i in regexp_replace ignored for polish letters
Previous Message Robert Haas 2009-10-22 16:32:33 Re: BUG #5130: Failed to run initdb:1

Browse pgsql-hackers by date

  From Date Subject
Next Message David Jantzen 2009-10-22 17:28:19 Re: Fwd: Reversing flow of WAL shipping
Previous Message Marc Munro 2009-10-22 16:47:16 Re: Using views for row-level access control is leaky