Re: Bug with Tsearch and tsvector

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Donald Fraser" <postgres(at)kiwi-fraser(dot)net>, "[BUGS]" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Bug with Tsearch and tsvector
Date: 2010-04-26 20:54:56
Message-ID: 4BD5B7500200002500030E22@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Hmm, thanks for the reference, but I'm not sure this is specifying
> quite what we want to get at. In particular I note that it
> excludes '%' on the grounds that that ought to be escaped, so I
> guess this is specifying the characters allowed in an underlying
> URI, *not* the textual representation of a URI.

I'm not sure I follow you here -- % is disallowed "raw" because it
is itself the escape character to allow hexadecimal specification of
any disallowed character. So, being the escape character itself, we
would need to allow it.

Section 2.4, taken as a whole, makes sense to me, and argues that we
should always treat any text representation of a URI (including a
URL) as being in escaped form. If it weren't for backward
compatibility, I would feel strongly that we should take any of the
excluded characters as the end of a URI.

| A URI is always in an "escaped" form, since escaping or unescaping
| a completed URI might change its semantics. Normally, the only
| time escape encodings can safely be made is when the URI is being
| created from its component parts; each component may have its own
| set of characters that are reserved, so only the mechanism
| responsible for generating or interpreting that component can
| determine whether or not escaping a character will change its
| semantics. Likewise, a URI must be separated into its components
| before the escaped characters within those components can be
| safely decoded.

> Still, it seems like this is a sufficient defense against any
> complaints we might get for not treating "<" or ">" as part of a
> URL.

I would think so.

> I wonder whether we ought to reject any of the other characters
> listed here too. Right now, the InURLPath state seems to eat
> everything until a space, quote, or double quote mark. We could
> easily make it stop at "<" or ">" too, but what else?

>From the RFC:

| control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
| space = <US-ASCII coded character 20 hexadecimal>
| delims = "<" | ">" | "#" | "%" | <">
| unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Except, of course, that since % is the escape character, it is OK.

Hmm. Having typed that, I'm staring at the # character, which is
used to mark off an anchor within an HTML page identified by the
URL. Should we consider the # and anchor part of a URL? Any other
questionable characters?

-Kevin

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kevin Grittner 2010-04-26 20:58:11 Re: Bug with Tsearch and tsvector
Previous Message Tom Lane 2010-04-26 19:23:53 Re: Bug with Tsearch and tsvector