Re: Native XML

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: "Anton" <antonin(dot)houska(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Native XML
Date: 2011-03-01 19:15:29
Message-ID: 4D6CF171020000250003B20E@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> On 02/28/2011 05:28 PM, Kevin Grittner wrote:
>> Anton<antonin(dot)houska(at)gmail(dot)com> wrote:
>>
>>> it was actually the focal point of my considerations: whether to
>>> store plain text or 'something else'.
>
> There seems to be an almost universal assumption that storing XML
> in its native form (i.e. a text stream) is going to produce
> inefficient results.

Well, certainly not in all cases. Finding those rows which satisfy
an XPath search among a few million rows with 20KB XML fields might
benefit from sort of indexing, though.

> unless we implemented our own XPath processor to work with our own
> XML format (do we really want to do that?), to evaluate an XPath
> expression for a piece of XML we'd actually need to produce the
> text format from our internal format before passing it to some
> external library to parse into its internal format and then
> process the XPath expression.

My suggestion was that you would store the text format, and allow
the developer to create a sharded format in a different column with
a different type if desired, not the other way around. As I said,
similar to what a developer would do for tsvector to allow text
searches. I agree that creating the text from an internal format
doesn't sound good.

>> Given that there were similar issues for other hierarchical data
>> types, perhaps we need something similar to tsvector, but for
>> hierarchical data. The extra layer of abstraction might not cost
>> much when used for XML compared to the possible benefit with
>> other data. It seems likely to be a very nice fit with GiST
>> indexes.
>>
>> So under this idea, you would always have the text (or maybe byte
>> array?) version of the XML, and you could "shard" it to a
>> separate column for fast searches.

> Tsearch should be able to handle XML now. It certainly knows how
> to recognize XML tags.

I apparently didn't express myself very well, since you seem to have
*completely* missed my point. I know we can do tsearch2 searches
against XML, or JSON, or YAML, or (insert next week's new favorite
format here). What we can't currently do efficiently is search for
particular values in some particular place in the hierarchy of a
document. I've had loads of fun approximating it with regular
expressions, but some days I'd like life to be easier.

What I was arguing for is a new type which would represent the
structure in a fashion which was independent of the particular text
format and was efficient to traverse hierarchically. Done right,
that would map well to GiST. Although, thinking about that some
more, perhaps there would be a way to create a GiST index suitable
for that straight from the XML text, and avoid the sharded column.
A GiST index actually seems pretty close to what such a structure
would look like anyway....

-Kevin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-03-01 19:15:41 Re: pl/python tracebacks
Previous Message Josh Berkus 2011-03-01 19:12:49 Re: wrapping up this CommitFest (was Re: knngist - 0.8)