Quick Links

Re: Bug with Tsearch and tsvector

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, "Donald Fraser" <postgres(at)kiwi-fraser(dot)net>, "[BUGS]" <pgsql-bugs(at)postgresql(dot)org>
Subject:	Re: Bug with Tsearch and tsvector
Date:	2010-04-27 03:32:19
Message-ID:	22254.1272339139@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Hmm. Having typed that, I'm staring at the # character, which is
> used to mark off an anchor within an HTML page identified by the
> URL. Should we consider the # and anchor part of a URL?

Yeah, I would think so.

This discussion is making me think that my previous patch went in the
wrong direction. The way that the existing code works is that after
seeing something that looks host-name-ish followed by a '/', it goes to
the FilePath state, which is why "/press.aspx" gets parsed as a file
name in my previous example. It only goes to the URLPath state if,
while in FilePath state, it sees '?'. This seems a tad bizarre, and
it means that anything we do to the URLPath rules will only affect the
part of a URL following a '?'.

What I think might be the right thing instead, if we are going to
tighten up what URLPath accepts, is to go directly to URLPath state
after seeing host-name-and-'/'. This eliminates the problem of
sometimes reporting "file" where we would have said "url_path"
before, and gives us a chance to apply the URLPath rules uniformly
to all text following a hostname.

Attached is a patch that does it that way instead. We'd probably
not want to apply this as-is, but should first tighten up what
characters URLPath allows, per Kevin's spec research.

I find that this patch does create a couple of changes in the regression
test outputs. The reason is that it parses this case differently:
select * from ts_debug('http://5aew.werc.ewr:8100/?');
Existing code says that that is

alias | description | token | dictionaries | dictionary | lexemes
----------+---------------+--------------------+--------------+------------+----------------------
protocol | Protocol head | http:// | {} | |
host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100}
blank | Space symbols | /? | {} | |
(3 rows)

while with this patch we get

alias | description | token | dictionaries | dictionary | lexemes
----------+---------------+----------------------+--------------+------------+------------------------
protocol | Protocol head | http:// | {} | |
url | URL | 5aew.werc.ewr:8100/? | {simple} | simple | {5aew.werc.ewr:8100/?}
host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100}
url_path | URL path | /? | {simple} | simple | {/?}
(4 rows)

Offhand I see no reason to discriminate against "/?" as a URL path, so
this change seems fine to me, but it is a change.

Thoughts?

regards, tom lane

Attachment	Content-Type	Size
url-path-fix-v2.patch	text/x-patch	2.6 KB

In response to

Re: Bug with Tsearch and tsvector at 2010-04-26 20:54:56 from Kevin Grittner

Responses

Re: Bug with Tsearch and tsvector at 2010-04-27 14:22:24 from Kevin Grittner

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Kevin Grittner	2010-04-27 14:22:24	Re: Bug with Tsearch and tsvector
Previous Message	Kevin Grittner	2010-04-26 20:58:11	Re: Bug with Tsearch and tsvector