Quick Links

Re: Bug with Tsearch and tsvector

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Donald Fraser" <postgres(at)kiwi-fraser(dot)net>
Cc:	"[BUGS]" <pgsql-bugs(at)postgresql(dot)org>
Subject:	Re: Bug with Tsearch and tsvector
Date:	2010-04-26 14:55:16
Message-ID:	18042.1272293716@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

"Donald Fraser" <postgres(at)kiwi-fraser(dot)net> writes:
> Using the default tsearch configuration, for 'english', text is being wrongly parsed into the tsvector type.

ts_debug shows that it's being parsed like this:

alias | description | token | dictionaries | dictionary | lexemes
-----------------+---------------------------------+----------------------------------------+----------------+--------------+------------------------------------------
tag | XML tag | <span lang="EN-GB"> | {} | |
protocol | Protocol head | http:// | {} | |
url | URL | www.harewoodsolutions.co.uk/press.aspx | {simple} | simple | {www.harewoodsolutions.co.uk/press.aspx}
host | Host | www.harewoodsolutions.co.uk | {simple} | simple | {www.harewoodsolutions.co.uk}
url_path | URL path | /press.aspx</span><span | {simple} | simple | {/press.aspx</span><span}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | lang | {english_stem} | english_stem | {lang}
... etc ...

ie the critical point seems to be that url_path is willing to soak up a
string containing "<" and ">", so the span tags don't get recognized as
separate lexemes. While that's "obviously" the wrong thing in this
particular example, I'm not sure if it's the wrong thing in general.
Can anyone comment on the frequency of usage of those two symbols in
URLs?

In any case it's weird that the URL lexeme doesn't span the same text
as the url_path one, but I'm not sure which one we should consider
wrong.

regards, tom lane

In response to

Bug with Tsearch and tsvector at 2010-04-26 13:51:35 from Donald Fraser

Responses

Re: Bug with Tsearch and tsvector at 2010-04-26 18:19:52 from Kevin Grittner
Re: Bug with Tsearch and tsvector at 2010-04-26 19:23:53 from Tom Lane

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Kevin Grittner	2010-04-26 18:19:52	Re: Bug with Tsearch and tsvector
Previous Message	Kevin Grittner	2010-04-26 14:24:39	Re: BUG #5438: Bug/quirk in ascii() function