full-text search question

From: Sabbiolina <sabbiolina(at)gmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: full-text search question
Date: 2008-06-18 12:49:48
Message-ID: 269b27950806180549k323833c7n38a0d9f434542bf2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hello,

I've seen that the default parser for the full-text search can identify
e-mail addresses, hosts, URLs… but I have a serious problem with it:

Suppose I index the following sentence "the search engine I use the most is
www.google.com"

And I search "google" no result is found.

Instead if I search "www.google.com" the record is found correctly.

I guess the reason is because the parser treats www.google.com as a single
token (of type 'host') but as everyone can easily see the result of this is
a major problem. In fact the word "google" actually is in the above
sentence, and the end-user of the database obviously asks me "why does your
FTS not find that record when I can clearly see that my search term is
there?"

Reading the docs I've seen that the parser can produce multiple tokens for
the same word (for example the word "make-up" produces 4 tokens: make-up,
make, -, up)… why not doing the same with URLs and e-mails? Why
www.google.com is only treated as a unique word? Why not producing multiple
tokens like www.google.com, www, ., google, ., com? (obviously www and . can
be nulled or stopworded).

Does anybody know of a better parser for Postgres? Or at least a trick to
make its FTS find the record above by searching only a part of the URL?

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Oleg Bartunov 2008-06-18 13:19:24 Re: full-text search question
Previous Message wolfgang.graf 2008-06-18 10:28:38 Move postmater.pid completly