Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Steve Atkins <steve(at)blighty(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores
Date: 2010-03-13 03:08:30
Message-ID: 201003130308.o2D38U204192@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Steve Atkins wrote:
>
> On Mar 12, 2010, at 5:18 PM, Tom Lane wrote:
>
> > Bruce Momjian <bruce(at)momjian(dot)us> writes:
> >> Well, I think the big question is whether we need to honor RFC 5322
> >> (http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
> >> all valid characters:
> >
> >> http://en.wikipedia.org/wiki/E-mail_address
> >
> >> * Uppercase and lowercase English letters (a-z, A-Z)
> >> * Digits 0 to 9
> >> * Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
> >> * Character . (dot, period, full stop) provided that it is not the
> >> first or last character, and provided also that it does not appear two
> >> or more times consecutively.
> >
> > That's an awful lot of special characters. For the RFC's purposes,
> > it's not hard to be flexible because in an email message there is
> > external context telling where to expect an address. I think if we
> > tried to allow all of those in email addresses in tsearch, we'd have
> > "email addresses" gobbling up a whole lot of adjacent text, to nobody's
> > benefit.
> >
> > I can see the case for adding "+" because that's fairly common as Alvaro
> > notes, but I think we should be very circumspect about going farther.
>
> I've been working with recognizing email addresses in text for
> years, with many millions of documents processed. Recognizing
> them in text is a very different problem to validating them or sanitizing
> them. Using the RFC spec to match things that "might be an email
> address" isn't a great idea in the wild, so +1 on the circumspect.
>
> I've found that /[a-z0-9_][^<\"@\\s]{0,80})@/ is good at finding local parts
> of "real" email addresses in free text in the wild, without getting being
> too prone to grab things that just look vaguely like email addresses. Obviously
> there are some things it'll match that aren't email addresses, and some
> email addresses it won't match, but for indexing it's been really pretty
> good when combined with a good regex for domain parts[1].

OK, based on your experience, I think we have gone far enough by
allowing underscores. I have applied the attached patch to document
what symbols we do allow.

Just for thrills, I want to point out that even the description is not
accurate. Look what happens when a dash follows an underscore:

test=> select ts_parse('default', ' a-b_c(at)yahoo(dot)com ' );
ts_parse
---------------------
(12," ")
(4,a-b_c(at)yahoo(dot)com)
(12," ")
(3 rows)

test=> select ts_parse('default', ' a-b-_c(at)yahoo(dot)com ' );
ts_parse
-----------------
(12," ")
(16,a-b)
(11,a)
(12,-)
(11,b)
(12,-_)
(4,c(at)yahoo(dot)com)
(12," ")
(8 rows)

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do

Attachment Content-Type Size
/rtmp/diff text/x-diff 795 bytes

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Bruce Momjian 2010-03-13 03:10:59 Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores
Previous Message Tom Lane 2010-03-13 02:06:08 Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2010-03-13 03:10:59 Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores
Previous Message Tom Lane 2010-03-13 02:06:08 Re: Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores