Re: Bug with Tsearch and tsvector

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, "Donald Fraser" <postgres(at)kiwi-fraser(dot)net>, "[BUGS]" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Bug with Tsearch and tsvector
Date: 2010-04-27 03:32:19
Message-ID: 22254.1272339139@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Hmm. Having typed that, I'm staring at the # character, which is
> used to mark off an anchor within an HTML page identified by the
> URL. Should we consider the # and anchor part of a URL?

Yeah, I would think so.

This discussion is making me think that my previous patch went in the
wrong direction. The way that the existing code works is that after
seeing something that looks host-name-ish followed by a '/', it goes to
the FilePath state, which is why "/press.aspx" gets parsed as a file
name in my previous example. It only goes to the URLPath state if,
while in FilePath state, it sees '?'. This seems a tad bizarre, and
it means that anything we do to the URLPath rules will only affect the
part of a URL following a '?'.

What I think might be the right thing instead, if we are going to
tighten up what URLPath accepts, is to go directly to URLPath state
after seeing host-name-and-'/'. This eliminates the problem of
sometimes reporting "file" where we would have said "url_path"
before, and gives us a chance to apply the URLPath rules uniformly
to all text following a hostname.

Attached is a patch that does it that way instead. We'd probably
not want to apply this as-is, but should first tighten up what
characters URLPath allows, per Kevin's spec research.

I find that this patch does create a couple of changes in the regression
test outputs. The reason is that it parses this case differently:
select * from ts_debug('http://5aew.werc.ewr:8100/?');
Existing code says that that is

alias | description | token | dictionaries | dictionary | lexemes
----------+---------------+--------------------+--------------+------------+----------------------
protocol | Protocol head | http:// | {} | |
host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100}
blank | Space symbols | /? | {} | |
(3 rows)

while with this patch we get

alias | description | token | dictionaries | dictionary | lexemes
----------+---------------+----------------------+--------------+------------+------------------------
protocol | Protocol head | http:// | {} | |
url | URL | 5aew.werc.ewr:8100/? | {simple} | simple | {5aew.werc.ewr:8100/?}
host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100}
url_path | URL path | /? | {simple} | simple | {/?}
(4 rows)

Offhand I see no reason to discriminate against "/?" as a URL path, so
this change seems fine to me, but it is a change.

Thoughts?

regards, tom lane

Attachment Content-Type Size
url-path-fix-v2.patch text/x-patch 2.6 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kevin Grittner 2010-04-27 14:22:24 Re: Bug with Tsearch and tsvector
Previous Message Kevin Grittner 2010-04-26 20:58:11 Re: Bug with Tsearch and tsvector