From: | Euler Taveira de Oliveira <euler(at)timbira(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Dan O'Hara <danarasoftware(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores |
Date: | 2009-10-22 19:39:36 |
Message-ID: | 4AE0B4F8.1010604@timbira.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
Robert Haas escreveu:
> I'm not real familiar with ts_parse(), but I'm thinking that it
> doesn't have any special casing for email addresses and is just
> intended to parse text for full-text-search - in which case splitting
> on _ is a pretty good algorithm.
>
It is a bug. The tsearch claims to identify types of tokens but it doesn't
correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails
to recognize an e-mail address. For example, foo+bar(at)baz(dot)com is a valid e-mail
but the function fails to report that.
It is not that simple to identify an e-mail address that agrees with RFC. As
that code is a state machine, IMHO it decides too early (when it finds _) that
that string is not an e-mail address. AFAIR, that's not an one-line fix.
euler=# select distinct token as email from ts_parse('default',
'foo(dot)bar(at)baz(dot)com');
email
─────────────────
foo(dot)bar(at)baz(dot)com
(1 row)
euler=# select distinct token as email from ts_parse('default',
'foo+bar(at)baz(dot)com');
email
─────────────
foo
+
bar(at)baz(dot)com
(3 rows)
euler=# select distinct token as email from ts_parse('default',
'foo_bar(at)baz(dot)com');
email
─────────────
foo
bar(at)baz(dot)com
_
(3 rows)
--
Euler Taveira de Oliveira
http://www.timbira.com/
From | Date | Subject | |
---|---|---|---|
Next Message | Stephen Frost | 2009-10-22 19:42:34 | psql -1 -f - busted |
Previous Message | Tom Lane | 2009-10-22 18:39:35 | Re: BUG #5039: 'i' flag i in regexp_replace ignored for polish letters |
From | Date | Subject | |
---|---|---|---|
Next Message | Dan O'Hara | 2009-10-22 19:54:56 | Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores |
Previous Message | Dimitri Fontaine | 2009-10-22 19:13:23 | Re: Controlling changes in plpgsql variable resolution |