From: | "Dan O'Hara" <danarasoftware(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores |
Date: | 2009-10-22 17:10:07 |
Message-ID: | 557802370910221010k5669e9f0v559213d998e286d3@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
Thanks for having a look at this bug.
According to section 12.8.2 of the postgres manual, ts_parse is
supposed to recognize different types of data, one of which (#4) is an
email address.
The list of recognized data formats for parse can be selected via this query:
SELECT * FROM ts_token_type('default');
The example in the bug I reported is valid email address, according to
the RFC, but isn't recognized as such by the full text search in
postgres. This bug will have a real impact on anybody using ts
functions to locate email addresses, as only some of them are found in
the query.
Regards
Dan
On Thu, Oct 22, 2009 at 12:29 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftware(at)gmail(dot)com> wrote:
>>
>> The following bug has been logged online:
>>
>> Bug reference: 5021
>> Logged by: Dan O'Hara
>> Email address: danarasoftware(at)gmail(dot)com
>> PostgreSQL version: 8.3.7
>> Operating system: win32
>> Description: ts_parse doesn't recognize email addresses with
>> underscores
>> Details:
>>
>> In the following example,
>>
>> select distinct token as email
>> from ts_parse('default', ' first_last(at)yahoo(dot)com ' )
>> where tokid = 4
>>
>> ts_parse returns last(at)yahoo(dot)com rather than first_last(at)yahoo(dot)com It seems
>> that any text prior to the underscore is truncated. If the portion
>> following the underscore is only numeric, such as this example,
>>
>> select distinct token as email
>> from ts_parse('default', ' bill_2000(at)yahoo(dot)com ' )
>> where tokid = 4
>>
>> then ts_parse returns nothing at all.
>>
>> section 3.2.3 of RFC 5322 indicates that underscores are valid characters in
>> an email address.
>>
>> http://tools.ietf.org/html/rfc5322
>
> I don't think this has much to do with email addresses. If you do:
>
> select token from ts_parse('a_b');
>
> ...you get three tokens. In your case you're pulling out the fourth
> token, but some of your examples don't have four tokens, so then you
> get nothing at all.
>
> I'm not real familiar with ts_parse(), but I'm thinking that it
> doesn't have any special casing for email addresses and is just
> intended to parse text for full-text-search - in which case splitting
> on _ is a pretty good algorithm.
>
> ...Robert
>
--
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
danarasoftware(at)gmail(dot)com
613 288-8733
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2009-10-22 18:39:35 | Re: BUG #5039: 'i' flag i in regexp_replace ignored for polish letters |
Previous Message | Robert Haas | 2009-10-22 16:32:33 | Re: BUG #5130: Failed to run initdb:1 |
From | Date | Subject | |
---|---|---|---|
Next Message | David Jantzen | 2009-10-22 17:28:19 | Re: Fwd: Reversing flow of WAL shipping |
Previous Message | Marc Munro | 2009-10-22 16:47:16 | Re: Using views for row-level access control is leaky |