Re: patch: function xmltable

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch: function xmltable
Date: 2016-09-12 06:46:26
Message-ID: CAMsr+YF4nv+oT6Pk38CUiD5eEoJNLY=JrAs_xRp3CPqbXO2kjQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12 September 2016 at 13:07, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:

>> Out of interest, should the syntax allow room for future expansion to
>> permit reading from file rather than just string literal / column
>> reference? It'd be ideal to avoid reading big documents wholly into
>> memory when using INSERT INTO ... SELECT XMLTABLE (...) . I don't
>> suggest adding that to this patch, just making sure adding it later
>> would not cause problems.
>
>
> this is little bit different question - it is server side function, so first
> question is - how to push usually client side content to server? Next
> question is how to get this content to a executor. Now only COPY statement
> is able to do.

Probably start with support for server-side files. When people are
dealing with really big files they'll be more willing to copy files to
the server or bind them into the server file system over the network.

The v3 protocol doesn't really allow any way for client-to-server
streaming during a query, I think that's hopeless until we have a
protocol bump.

> updated patch attached - with your documentation.

>> - Can you send the sample data used to generate the example output?
>> I'd like to include at least a cut down part of it in the docs to make
>> it clear how the input correlates with output, and preferably put the
>> whole thing in an appendix.
>
>
> it is in regress tests.

Makes sense.

It's not that verbose (for XML) and I wonder if it's just worth
including it in-line in the docs along with the XMLTABLE example. It'd
be much easier to understand how XMLTABLE works and what it does then.

>> - What effect does xmlnamespaces clause have? Does supplying it allow
>> you to reference qualified names in xpath? What happens if you don't
>> specify it for a document that has namespaces or don't define all the
>> namespaces? What if you reference an undefined namespace in xpath?
>> What about if an undefined namespace isn't referenced by xpath, but is
>> inside a node selected by an xpath expression?
>
>
> All this is under libxml2 control - when you use undefined namespace, then
> libxml2 raises a error. The namespaces in document and in XPath queries are
> absolutely independent - the relation is a URI. When you use bad URI
> (referenced by name), then the result will be empty set. When you use
> undefined name, then you will get a error.

OK, makes sense.

>> - What are the rules for converting the matched XML node into a value?
>> If the matched node is not a simple text node or lacks a text node as
>> its single child, what happens?
>
> This process is described and controlled by "XML SQL mapping". The Postgres
> has minimalistic implementation without possibility of external control and
> without schema support. The my implementation is simple. When user doesn't
> specify result target like explicit using of text() function, then the
> text() function is used implicitly when target type is not XML. Then I dump
> result to string and I enforce related input functions for target types.

OK, so a subset of the full spec functionality is provided because of
limitations in Pg and libxml2. Makes sense.

My only big concern here is that use of text() is a common mistake in
XSLT, and I think the same thing will happen here. Users expect
comments to be ignored, but in fact a comment inserts a comment node
into the XML DOM, so a comment between two pieces of text produces a
tree of

element
text()
comment()
text()

If you match element/text(), you get a 2-node result and will presumably ERROR.

There is no good way to tell this from

element
text()
element
text()

when you use an xpath expression like element/text() . So you can't
safely solve it just by concatenating all resulting text() nodes
without surprising behaviour.

>> - What happens if the matched has multiple text node children? This
>> can happen if, for example, you have something like
>>
>> <matchedNode>
>> some text <!-- comment splits up text node --> other text
>> </matchedNode>
>
>
> depends on target type - it is allowed in XML, and it is disallowed for
> other types. I though about support of a arrays - but the patch will be much
> more complex - there can be recursion - so I disallowed it. When the user
> have to solve this issue, then he can use nested XMLTABLE functions and
> nested function is working with XML type.

I don't really understand how that'd work.

Do you know how other implementations handle this?

I think users are going to be VERY surprised when comments in text
break their XML.

>> - Is there a way to get an attribute as a value? If so, an example
>> should show this because it's going to be a common need. Presumably
>> you want node/@attrname ?
>
>
> you can use reference to current node "." - so "./@attname" should to work -
> a example is in regress tests

cool, just needs mention in docs then.

>> - PASSING clause isn't really defined. You can specify one PASSING
>> entry as a literal/colref/expression, and it's the argument xml
>> document, right? The external docs you referred to say that PASSING
>> may have a BY VALUE keyword, alias its argument with AS, and may have
>> expressions, e.g.
>>
>> PASSING BY VALUE '<x/>' AS a, '<y/>' AS b
>>
>> Neither aliases nor multiple entries are supported by the code or
>> grammar. Should this be documented as a restriction?
>
>
> The ANSI allows to pass more documents - and then do complex queries with
> XQuery. Passing more than one document has not sense in libxml2 based
> implementation, so I didn't supported it. The referenced names can be
> implemented later - but it needs to changes in XPATH function too.

OK, so my docs addition that just says they're not supported should be fine.

>> - What does BY REF mean? Should this just be mentioned with a "see
>> xmlexists(...)" since it seems to be compatibility noise? Is there a
>> corresponding BY VALUE or similar?
>
> When the XML document is stored as serialized DOM, then by ref means link on
> this DOM. It has not sense in Postgres - because we store XML documents by
> value only.

Right. And since there's already precent for xmlexists there's no
point worrying about whether we lose opportunities to implement it
later.

>> - Why do the expression arguments take c_expr (anything allowed in
>> a_expr or b_expr), not b_expr (restricted expression) ?
>
>
> I don't know - I expect the problems with parser - because PASSING is
> restricted keyword in ANSI/SQL and unreserved keyword in Postgres.

I mean for the rowpath argument, not the parts within
xmlexists_argument. If the rowpath doesn't need to be c_expr
presumably it should be a b_expr or even, if it doesn't cause parsing
ambiguities, an a_expr ? There doesn't seem to be the same issue here
as we have with BETWEEN etc.

>> - Column definitions are underdocumented. The grammar says they can be
>> NOT NULL, for example, but I don't see that in any of the references
>> you mailed me nor in the docs. What behaviour is expected for a NOT
>> NULL column? I've documented my best guess (not checked closely
>> against the code, please verify).
>>
>
> yes - some other databases allows it - I am thinking so it is useful.

Sure. Sounds like my docs additions are probably right then, except
for incorrect description of DEFAULT.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Prabhat Sahu 2016-09-12 06:50:45 Re: Aggregate Push Down - Performing aggregation on foreign server
Previous Message Simon Riggs 2016-09-12 06:30:03 Re: Re: [COMMITTERS] pgsql: Use LEFT JOINs in some system views in case referenced row doesn