Re: indexing longish string

From: Isaac Dover <isaacdover(at)gmail(dot)com>
To: pgsql-sql(at)postgresql(dot)org, Rob Sargent <robjsargent(at)gmail(dot)com>
Subject: Re: indexing longish string
Date: 2010-11-30 20:03:04
Message-ID: AANLkTikEpYW3icgh+qy6e+QnJjWOnrYfs4WW0zo_Xxjj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

No problem, sir, hopefully I could help. I failed to mention that I've
discovered some bugs in the PostgreSQL 8.4 XML implementation that forced me
to take pause and ultimately forego XML with Postgres. I haven't looked at 9
yet, but considering the current lack of interest and/or disdain so many
have for XML, I'm not anticipating great improvements. Of course, this is
not to knock the db - Pg is awesome. But I'm aware that we all have limited
time and there are other more important features that far more people
desire. So, you are probably better off using string literals with all of
the known risks. I've used some lame hashing techniques that simply stripped
out everything except [:alnum:] and then simply alphabetized the characters
and sometimes sent them through SHA/MD5 to get a unique representation. It
seemed to work fine for my purposes.

Or take the old, proven approach of shredding documents into normalized
tables and serializing them back out when you need them. ... ick !! ...

Thanks,
Isaac

On Tue, Nov 30, 2010 at 2:33 PM, Rob Sargent <robjsargent(at)gmail(dot)com> wrote:

> Hello Issac,
>
> Thanks a ton.
>
> In this bizarre little corner, I'm pretty sure I would use a text field
> rather than any xml-friendly datatype. The xml content itself will be
> trivial yet dynamic :). We need just a little more rigour than a hacky
> delimited string implementation, but not much. And we guarantee the
> repeatability of the string for a given parameter set.
>
> We could also do a manual tag-value implementation, and likely would if
> we thought we were ever going to ask questions like "everything with tag
> foo value less than ten", but that's not on the table. (Today...)
>
> On 11/30/2010 12:21 PM, Isaac Dover wrote:
> > Hi,
> >
> > While hashing is certainly a good idea, you really should consider some
> > issues well before you get to that point. Trust me, this could save you
> > some headaches. First, though you're probably already aware, two XML
> > documents can be the same document, but with very different literal
> > representations. By the XML spec, these two DOMs would be equal (readers
> > would treat them somewhat differently):
> >
> > <aaa b="123" c="55" />
> > and
> > <aaa c="55" b="123" ></aaa>
> >
> > If you're looking to compare the strings literally, then you have to
> > make sure that you canonicalize both the persisted string as well as the
> > comparison in your queries. I first encountered this problem when
> > storing XML as a vendor's data type of XML rather than varchar. Upon
> > inserting, the database engine parsed the string into DOM for the XML
> > data type so that XPath and XQuery statements could be processed. This
> > is all fine and dandy until I retrieved the literal representation. The
> > document <xxx></xxx> was transformed, as in the above example, to the
> > literal <xxx />, but my select was comparing using the former
> > string. Adding to the problem was that pretty much every tool that
> > renders XML tends to hide these sort of differences in an effort to be
> > "friendly" (save Vim, of course :). It took a bit of time for me to
> > notice these small differences. Also keep in mind that document order,
> > namespaces, and space normalization are other important considerations.
> >
> > If you can store as DOM, you might experiment a bit to see if the
> > process of canonicalizing is more/less efficient than just XPath'ing.
> > Though, indexing a hash would probably be most efficient but at the risk
> > of all that jibberish I just typed.
> >
> > More info here:
> > http://www.w3.org/TR/xml-c14n11/
> > http://www.w3.org/TR/2010/WD-xml-c14n2-20100831/
> >
> > Thanks, and good luck!
> > Isaac
> >
> > On Tue, Nov 30, 2010 at 1:36 PM, Kenneth Marshall <ktm(at)rice(dot)edu
> > <mailto:ktm(at)rice(dot)edu>> wrote:
> >
> > You can use a hash index for this. It's drawback is that it is not
> > yet WAL enabled and if your DB crashes you will need to rebuild the
> > index to fix the corruption. It works well(only) with equality
> > searches. If it is a scenario where you must have WAL, use a
> > function index based on the hash of the string.
> >
> > Cheers,
> > Ken
> >
>

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Jeff Bland 2010-11-30 22:16:53 DELETE WHERE EXISTS unexpected results
Previous Message Peter Eisentraut 2010-11-30 19:31:02 Re: How strings are sorted by LC_COLLATE specifically?