Re: Bug with Tsearch and tsvector

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: "Donald Fraser" <postgres(at)kiwi-fraser(dot)net>, "[BUGS]" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Bug with Tsearch and tsvector
Date: 2010-04-26 19:23:53
Message-ID: 11841.1272309833@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I wrote:
> "Donald Fraser" <postgres(at)kiwi-fraser(dot)net> writes:
>> Using the default tsearch configuration, for 'english', text is being wrongly parsed into the tsvector type.

> ts_debug shows that it's being parsed like this:

> alias | description | token | dictionaries | dictionary | lexemes
> -----------------+---------------------------------+----------------------------------------+----------------+--------------+------------------------------------------
> tag | XML tag | <span lang="EN-GB"> | {} | |
> protocol | Protocol head | http:// | {} | |
> url | URL | www.harewoodsolutions.co.uk/press.aspx | {simple} | simple | {www.harewoodsolutions.co.uk/press.aspx}
> host | Host | www.harewoodsolutions.co.uk | {simple} | simple | {www.harewoodsolutions.co.uk}
> url_path | URL path | /press.aspx</span><span | {simple} | simple | {/press.aspx</span><span}
> blank | Space symbols | | {} | |
> asciiword | Word, all ASCII | lang | {english_stem} | english_stem | {lang}
> ... etc ...

> ie the critical point seems to be that url_path is willing to soak up a
> string containing "<" and ">", so the span tags don't get recognized as
> separate lexemes. While that's "obviously" the wrong thing in this
> particular example, I'm not sure if it's the wrong thing in general.
> Can anyone comment on the frequency of usage of those two symbols in
> URLs?

> In any case it's weird that the URL lexeme doesn't span the same text
> as the url_path one, but I'm not sure which one we should consider
> wrong.

I poked at this a bit. The reason for the inconsistency between the url
and url_path lexemes is that the InURLPathStart state transitions
directly to InURLPath, which is *not* consistent with what happens while
parsing the URL as a whole: p_isURLPath() starts the sub-parser in
InFileFirst state. The attached proposed patch rectifies that by
transitioning to InFileFirst state instead. A possible objection to
this fix is that you may get either a "file" or a "url_path" component
lexeme, where before you always got "url_path". I'm not sure if that's
something to worry about or not; I'd tend to think there's nothing much
wrong with it.

The other change in the attached patch is to make InURLPath parsing
stop at "<" or ">", as per discussion.

With these changes I get

regression=# SELECT * from ts_debug('http://www.harewoodsolutions.co.uk/press.aspx</span>');
alias | description | token | dictionaries | dictionary | lexemes
----------+-------------------+----------------------------------------+--------------+------------+------------------------------------------
protocol | Protocol head | http:// | {} | |
url | URL | www.harewoodsolutions.co.uk/press.aspx | {simple} | simple | {www.harewoodsolutions.co.uk/press.aspx}
host | Host | www.harewoodsolutions.co.uk | {simple} | simple | {www.harewoodsolutions.co.uk}
file | File or path name | /press.aspx | {simple} | simple | {/press.aspx}
tag | XML tag | </span> | {} | |
(5 rows)

as compared to the prior behavior

regression=# SELECT * from ts_debug('http://www.harewoodsolutions.co.uk/press.aspx</span>');
alias | description | token | dictionaries | dictionary | lexemes
----------+---------------+----------------------------------------+--------------+------------+------------------------------------------
protocol | Protocol head | http:// | {} | |
url | URL | www.harewoodsolutions.co.uk/press.aspx | {simple} | simple | {www.harewoodsolutions.co.uk/press.aspx}
host | Host | www.harewoodsolutions.co.uk | {simple} | simple | {www.harewoodsolutions.co.uk}
url_path | URL path | /press.aspx</span> | {simple} | simple | {/press.aspx</span>}
(4 rows)

Neither change affects the current set of regression tests; but none the
less there's a potential compatibility issue here, so my thought is to
apply this only in HEAD.

Comments?

regards, tom lane

Attachment Content-Type Size
url_path_fix.patch text/x-patch 2.0 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kevin Grittner 2010-04-26 20:54:56 Re: Bug with Tsearch and tsvector
Previous Message Tom Lane 2010-04-26 18:43:17 Re: Bug with Tsearch and tsvector