Re: indexing longish string

From: Isaac Dover <isaacdover(at)gmail(dot)com>
To: Rob Sargent <robjsargent(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject: Re: indexing longish string
Date: 2010-11-30 19:21:40
Message-ID: AANLkTi=u6WiqozCu3ghLjSQQ_S8vCxejEnbAo8xhh5rF@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Hi,

While hashing is certainly a good idea, you really should consider some
issues well before you get to that point. Trust me, this could save you some
headaches. First, though you're probably already aware, two XML documents
can be the same document, but with very different literal representations.
By the XML spec, these two DOMs would be equal (readers would treat them
somewhat differently):

<aaa b="123" c="55" />
and
<aaa c="55" b="123" ></aaa>

If you're looking to compare the strings literally, then you have to make
sure that you canonicalize both the persisted string as well as the
comparison in your queries. I first encountered this problem when storing
XML as a vendor's data type of XML rather than varchar. Upon inserting, the
database engine parsed the string into DOM for the XML data type so that
XPath and XQuery statements could be processed. This is all fine and dandy
until I retrieved the literal representation. The document <xxx></xxx> was
transformed, as in the above example, to the literal <xxx />, but my select
was comparing using the former string. Adding to the problem was that pretty
much every tool that renders XML tends to hide these sort of differences in
an effort to be "friendly" (save Vim, of course :). It took a bit of time
for me to notice these small differences. Also keep in mind that document
order, namespaces, and space normalization are other important
considerations.

If you can store as DOM, you might experiment a bit to see if the process of
canonicalizing is more/less efficient than just XPath'ing. Though, indexing
a hash would probably be most efficient but at the risk of all that
jibberish I just typed.

More info here:
http://www.w3.org/TR/xml-c14n11/
http://www.w3.org/TR/2010/WD-xml-c14n2-20100831/

Thanks, and good luck!
Isaac

On Tue, Nov 30, 2010 at 1:36 PM, Kenneth Marshall <ktm(at)rice(dot)edu> wrote:

> You can use a hash index for this. It's drawback is that it is not
> yet WAL enabled and if your DB crashes you will need to rebuild the
> index to fix the corruption. It works well(only) with equality
> searches. If it is a scenario where you must have WAL, use a
> function index based on the hash of the string.
>
> Cheers,
> Ken

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Peter Eisentraut 2010-11-30 19:31:02 Re: How strings are sorted by LC_COLLATE specifically?
Previous Message Kenneth Marshall 2010-11-30 18:36:27 Re: indexing longish string