Re: patch: function xmltable

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch: function xmltable
Date: 2016-09-24 06:01:33
Message-ID: CAFj8pRBZyamLA=BvAtXGq5pSBEG_N9we+uKy9rtFXedYq4rZQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

2016-09-23 10:05 GMT+02:00 Craig Ringer <craig(at)2ndquadrant(dot)com>:

> On 22 September 2016 at 02:31, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
> wrote:
>
> > another small update - fix XMLPath parser - support multibytes characters
>
> I'm returning for another round of review.
>
> The code doesn't handle the 5 XML built-in entities correctly in
> text-typed output. It processes &apos; and &quot; but not &amp, &lt or
> &gt; . See added test. I have not fixed this, but I think it's clearly
> broken:
>
>
> + -- XML builtin entities
> + SELECT * FROM xmltable('/x/a' PASSING
> '<x><a><ent>&apos;</ent></a><a><ent>&quot;</ent></a><a><
> ent>&amp;</ent></a><a><ent>&lt;</ent></a><a><ent>&gt;</ent></a></x>'
> COLUMNS ent text);
> + ent
> + -------
> + '
> + "
> + &amp;
> + &lt;
> + &gt;
> + (5 rows)
>
> so I've adjusted the docs to claim that they're expanded. The code
> needs fixing to avoid entity-escaping when the output column type is
> not 'xml'.
>
>
fixed

>
> &apos; and &quot; entities in xml-typed output are expanded, not
> preserved. I don't know if this is intended but I suspect it is:
>
> + SELECT * FROM xmltable('/x/a' PASSING
> '<x><a><ent>&apos;</ent></a><a><ent>&quot;</ent></a><a><
> ent>&amp;</ent></a><a><ent>&lt;</ent></a><a><ent>&gt;</ent></a></x>'
> COLUMNS ent xml);
> + ent
> + ------------------
> + <ent>'</ent>
> + <ent>"</ent>
> + <ent>&amp;</ent>
> + <ent>&lt;</ent>
> + <ent>&gt;</ent>
> + (5 rows)
>
>
> For the docs changes relevant to the above search for "The five
> predefined XML entities". Adjust that bit of docs if I guessed wrong
> about the intended behaviour.
>
> The tests don't cover CDATA or PCDATA . I didn't try to add that, but
> they should.
>
>
appended

> Did some docs copy-editing and integrated some examples. Explained how
> nested elements work, that multiple top level elements is an error,
> etc. Explained the time-of-evaluation stuff. Pointed out that you can
> refer to prior output columns in PATH and DEFAULT, since that's weird
> and unusual compared to normal SQL. Documented handling of multiple
> node matches, including the surprising results of somepath/text() on
> <somepath>x<!--blah-->y</somepath>. Documented handling of nested
> elements. Documented that xmltable works only on XML documents, not
> fragments/forests.
>

I don't understand to this sentence: "It is possible for a PATH expression
to reference output columns that appear before it in the column-list, so
paths may be dynamically constructed based on other parts of the XML
document:"

> Regarding evaluation time, it struck me that evaluating path
> expressions once per row means the xpath must be parsed and processed
> once per row. Isn't it desirable to store and re-use the preparsed
> xpath? I don't think this is a major problem, since we can later
> detect stable/immutable expressions including constants, evaluate only
> once in that case, and cache. It's just worth thinking about.
>

Probably it is possible - it is exactly how you wrote - it needs to check
the change. We can try do some possible performance optimizations later -
without compatibility issues. Now, I prefer the most simple code.

a note: PATH expression is evaluated for any **input** row. In same moment
is processed row path expression and man XML document DOM parsing. So
overhead of PATH expression and PATH parsing should not be dominant.

>
> The docs and tests don't seem to cover XML entities. What's the
> behaviour there? Core XML only defines one entity, but if a schema
> defines more how are they processed? The tests need to cover the
> predefined entities &quot; &amp; &apos; &lt; and &gt; at least.
>

I don't understand, what you are propose here. ?? Please, can you send some
examples.

>
> I have no idea whether the current code can fetch a DTD and use any
> <!ENTITY > declarations to expand entities, but I'm guessing not? If
> not, external DTDs, and internal DTDs with external entities should be
> documented as unsupported.
>
> It doesn't seem to cope with internal DTDs at all (libxml2 limitation?):
>
> SELECT * FROM xmltable('/' PASSING $XML$<?xml version="1.0"
> standalone="yes" ?>
> <!DOCTYPE foo [
> <!ELEMENT foo (#PCDATA)>
> <!ENTITY pg "PostgreSQL">
> ]>
> <foo>Hello &pg;.</foo>
> $XML$ COLUMNS foo text);
>
> + ERROR: invalid XML content
> + LINE 1: SELECT * FROM xmltable('/' PASSING $XML$<?xml version="1.0" ...
> + ^
> + DETAIL: line 2: StartTag: invalid element name
> + <!DOCTYPE foo [
> + ^
> + line 3: StartTag: invalid element name
> + <!ELEMENT foo (#PCDATA)>
> + ^
> + line 4: StartTag: invalid element name
> + <!ENTITY pg "PostgreSQL">
> + ^
> + line 6: Entity 'pg' not defined
> + <foo>Hello &pg;.</foo>
> + ^
>
>
It is rejected before XMLTABLE function call

postgres=# select $XML$<?xml version="1.0" standalone="yes" ?>
postgres$# <!DOCTYPE foo [
postgres$# <!ELEMENT foo (#PCDATA)>
postgres$# <!ENTITY pg "PostgreSQL">
postgres$# ]>
postgres$# <foo>Hello &pg;.</foo>
postgres$# $XML$::xml;
ERROR: invalid XML content
LINE 1: select $XML$<?xml version="1.0" standalone="yes" ?>
^
DETAIL: line 2: StartTag: invalid element name
<!DOCTYPE foo [
^
line 3: StartTag: invalid element name
<!ELEMENT foo (#PCDATA)>
^
line 4: StartTag: invalid element name
<!ENTITY pg "PostgreSQL">
^
line 6: Entity 'pg' not defined
<foo>Hello &pg;.</foo>
^
It is disabled by default in libxml2. I found a function
xmlSubstituteEntitiesDefault http://www.xmlsoft.org/entities.html
http://www.xmlsoft.org/html/libxml-parser.html#xmlSubstituteEntitiesDefault

The default behave should be common for all PostgreSQL's libxml2 based
function - and then it is different topic - maybe part for PostgreSQL ToDo?
But I don't remember any user requests related to this issue.

>
> libxml seems to support documents with internal DTDs:
>
> $ xmllint --valid /tmp/x
> <?xml version="1.0" standalone="yes"?>
> <!DOCTYPE foo [
> <!ELEMENT foo (#PCDATA)>
> <!ENTITY pg "PostgreSQL">
> ]>
> <foo>Hello &pg;.</foo>
>

I removed this tests - it is not related to XMLTABLE function, but to
generic XML processing/validation.

>
>
> so presumably the issue lies in the xpath stuff? Note that it's not
> even ignoring the DTD and choking on the undefined entity, it's
> choking on the DTD its self.
>
>
> OK, code comments:
>
>
> In +ExecEvalTableExpr, shouldn't you be using PG_ENSURE_ERROR_CLEANUP
> instead of a PG_TRY() / PG_CATCH() block?
>

If I understand to doc, the PG_ENSURE_ERROR_CLEANUP should be used, when
you want to catch FATAL errors (and when you want to clean shared memory).
XMLTABLE doesn't use shared memory, and doesn't need to catch fatal errors.

>
> I think the new way you handle the type stuff is much, much better,
> and with comments to explain too. Thanks very much.
>
>
> There's an oversight in tableexpr vs xmltable separation here:
>
> + case T_TableExpr:
> + *name = "xmltable";
> + return 2;
>
> presumably you need to look at the node and decide what kind of table
> expression it is or just use a generic "tableexpr".
>
> Same problem here:
>
> + case T_TableExpr:
> + {
> + TableExpr *te = (TableExpr *) node;
> +
> + /* c_expr shoud be closed in brackets */
> + appendStringInfoString(buf, "XMLTABLE(");
>
>
commented

>
>
> I don't have the libxml knowledge or remaining brain to usefully
> evaluate the xpath and xml specifics in xpath.c today. It does strike
> me that the new xpath parser should probably live in its own file,
> though.
>

moved

>
> I think this is all a big improvement. Barring the notes above and my
> lack of review of the guts of the xml.c parts of it, I'm pretty happy
> with what I see now.
>

new version is attached

Regards

Pavel

>
>
> --
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

Attachment Content-Type Size
xmltable-10.patch.gz application/gzip 28.2 KB
xmltable-diff10.diff text/x-patch 34.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2016-09-24 06:42:16 raised checkpoint limit & manual checkpoint
Previous Message Amit Kapila 2016-09-24 05:03:01 Re: store narrow values in hash indexes?