Skip site navigation (1) Skip section navigation (2)

Re: Bug with Tsearch and tsvector

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, "Donald Fraser" <postgres(at)kiwi-fraser(dot)net>, "[BUGS]" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Bug with Tsearch and tsvector
Date: 2010-04-27 03:32:19
Message-ID: 22254.1272339139@sss.pgh.pa.us (view raw or flat)
Thread:
Lists: pgsql-bugs
"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Hmm.  Having typed that, I'm staring at the # character, which is
> used to mark off an anchor within an HTML page identified by the
> URL.  Should we consider the # and anchor part of a URL?

Yeah, I would think so.

This discussion is making me think that my previous patch went in the
wrong direction.  The way that the existing code works is that after
seeing something that looks host-name-ish followed by a '/', it goes to
the FilePath state, which is why "/press.aspx" gets parsed as a file
name in my previous example.  It only goes to the URLPath state if,
while in FilePath state, it sees '?'.  This seems a tad bizarre, and
it means that anything we do to the URLPath rules will only affect the
part of a URL following a '?'.

What I think might be the right thing instead, if we are going to
tighten up what URLPath accepts, is to go directly to URLPath state
after seeing host-name-and-'/'.  This eliminates the problem of
sometimes reporting "file" where we would have said "url_path"
before, and gives us a chance to apply the URLPath rules uniformly
to all text following a hostname.

Attached is a patch that does it that way instead.  We'd probably
not want to apply this as-is, but should first tighten up what
characters URLPath allows, per Kevin's spec research.

I find that this patch does create a couple of changes in the regression
test outputs.  The reason is that it parses this case differently:
	select * from ts_debug('http://5aew.werc.ewr:8100/?');
Existing code says that that is

  alias   |  description  |       token        | dictionaries | dictionary |       lexemes        
----------+---------------+--------------------+--------------+------------+----------------------
 protocol | Protocol head | http://            | {}           |            | 
 host     | Host          | 5aew.werc.ewr:8100 | {simple}     | simple     | {5aew.werc.ewr:8100}
 blank    | Space symbols | /?                 | {}           |            | 
(3 rows)

while with this patch we get

  alias   |  description  |        token         | dictionaries | dictionary |        lexemes         
----------+---------------+----------------------+--------------+------------+------------------------
 protocol | Protocol head | http://              | {}           |            | 
 url      | URL           | 5aew.werc.ewr:8100/? | {simple}     | simple     | {5aew.werc.ewr:8100/?}
 host     | Host          | 5aew.werc.ewr:8100   | {simple}     | simple     | {5aew.werc.ewr:8100}
 url_path | URL path      | /?                   | {simple}     | simple     | {/?}
(4 rows)

Offhand I see no reason to discriminate against "/?" as a URL path, so
this change seems fine to me, but it is a change.

Thoughts?

			regards, tom lane


Attachment: url-path-fix-v2.patch
Description: text/x-patch (2.6 KB)

In response to

Responses

pgsql-bugs by date

Next:From: Kevin GrittnerDate: 2010-04-27 14:22:24
Subject: Re: Bug with Tsearch and tsvector
Previous:From: Kevin GrittnerDate: 2010-04-26 20:58:11
Subject: Re: Bug with Tsearch and tsvector

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group