Text search parser's treatment of URLs and emails

From: Thom Brown <thom(at)linux(dot)com>
To: PGSQL Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Text search parser's treatment of URLs and emails
Date: 2010-09-08 20:48:23
Message-ID: AANLkTikf=K=pen6M4bWKkt1QOzh8mbrEXKOYJ=H0qCMh@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I noticed that if I run this:

SELECT alias, description, token FROM
ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary');

I get:

alias | description | token
----------+---------------+-----------------------------------------------------------------
protocol | Protocol head | http://
url | URL |
www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
host | Host | www.postgresql.org:2345
url_path | URL path |
/directory/page.html?version=9.1&build=alpha1#summary
(4 rows)

It could be me being picky, but I don't regard parameters or page
fragments as part of the URL path. Ideally, I'd sort of expect:

alias | description | token
--------------+---------------+-----------------------------------------------------------------
protocol | Protocol head | http://
url | URL |
www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
host | Host | www.postgresql.org
port | Port | 2345
url_path | URL path | /directory/page.html
query_string | Query string | version=9.1&build=alpha1
fragment | Page fragment | summary
(7 rows)

... of course that's if there was support for query strings and page
fragments, which there isn't. But if changes were made to support my
definition of a URL path, they'd have to be considered breaking
changes.

But my main gripe is with the name "url_path".

Also:

SELECT alias, description, token FROM ts_debug('myname+priority(at)gmail(dot)com');

Yields:

alias | description | token
-----------+-----------------+--------------------
asciiword | Word, all ASCII | myname
blank | Space symbols | +
email | Email address | priority(at)gmail(dot)com
(3 rows)

The entire string I entered is a valid email address, and isn't
totally uncommon. Shouldn't that take such email address styles be
taken into account? The example above incorrectly identifies the
email address since the real destination address would most likely be
myname(at)gmail(dot)com(dot)

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2010-09-08 20:58:27 Re: Memory Errors
Previous Message John R Pierce 2010-09-08 20:35:00 Re: error while autovacuuming