Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context

From: Markus Winand <markus(dot)winand(at)winand(dot)at>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context
Date: 2018-04-30 12:03:46
Message-ID: 84F5DB35-DDE8-4EED-BD2A-85F45D8666FC@winand.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi and thanks for your efforts.

> On 2018-04-26, at 21:18 , Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
>
> Hello Markus,
>
> Markus Winand wrote:
>
>> * Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression.
>>
>>> Column_expressions that match TEXT or CDATA nodes must return
>>> the contents of the node itself, not the content of the
>>> non-existing childs (i.e. the empty string).
>>
>> The following query returns a wrong result in master:
>>
>> SELECT *
>> FROM (VALUES ('<xml>text</xml>'::xml)
>> , ('<xml><![CDATA[<cdata>]]></xml>'::xml)
>> ) d(x)
>> CROSS JOIN LATERAL
>> XMLTABLE('/xml'
>> PASSING x
>> COLUMNS "node()" TEXT PATH 'node()'
>> ) t
>
>> The correct result can be obtained with patch 0001 applied:
>>
>> x | node()
>> --------------------------------+---------
>> <xml>text</xml> | text
>> <xml><![CDATA[<cdata>]]></xml> | <cdata>
>>
>> The patch adopts existing tests to guard against future regressions.
>
> I remembered while reading this that Craig Ringer had commented that XML
> text sections can have multiple children: just put XML comments amidst
> the text.

As per my understanding of XML, this is a sibling—not a child—of two text nodes.

> To test this, I added a comment in one of the text values, in
> the regression database:
>
> update xmldata set data = regexp_replace(data::text, 'HongKong', 'Hong<!-- a space -->Kong')::xml;
>
> With the modified data, the new query in the regression tests fails:
>
> SELECT xmltable.*
> FROM (SELECT data FROM xmldata) x,
> LATERAL XMLTABLE('/ROWS/ROW'
> PASSING data
> COLUMNS id int PATH '@id',
> _id FOR ORDINALITY,
> country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
> country_id text PATH 'COUNTRY_ID',
> region_id int PATH 'REGION_ID',
> size float PATH 'SIZE',
> unit text PATH 'SIZE/@unit',
> premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');
>
> ERROR: more than one value returned by column Path expression

I’ve tried matching “node()” against XML that includes a mixture of text and
comment nodes in other database products. Two out of two failed with a
similar error message.

- ORA-19279: XPTY0004 - XQuery dynamic type mismatch: expected singleton sequence - got multi-item sequence
- SQLCODE=-16003, SQLSTATE=10507, SQLERRMC=( item(), item()+ )
See: https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.messages.sql.doc/doc/msql16003n.html

So I digged through the standard to figure out what the standard-mandated
behaviour is.

The bottom line is that an error is the standard mandated behavior because
SQL uses a XPath/XQuery “cast" in the procedure. The XPath/XQuery spec
says “If the result of atomization is a sequence of more than one atomic value,
a type error is raised [err:XPTY0004]” (compare that to the ORA- above).
https://www.w3.org/TR/xpath-31/#id-cast

For reference, how I came there:
The XPath/XQuery cast is triggered for non XML types in
XMLCAST (SQL-14:2011 6.6 GR 4 h iv.) which is in turn triggered by
XMLTABLE (SQL-14:2011 7.1 SR 4 e ii 8).
[Note: I only have SQL-14:2011, thus no references to :2016]

> This seems really undesirable, so I looked into getting it fixed. Turns
> out that this error is being thrown from inside the same function we're
> modifying, line 4559. I didn't have a terribly high opinion of that
> ereport() when working on this feature to begin with, so now that it's
> proven to provide a bad experience, let's try removing it. With that
> change (patch attached), the feature seems to work; I tried with this
> query, which seems to give the expected output (notice we have some
> columns of type xml, others of type text, with and without the text()
> function in the path):
>
> SELECT xmltable.*
> FROM (SELECT data FROM xmldata) x,
> LATERAL XMLTABLE('/ROWS/ROW'
> PASSING data COLUMNS
> country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
> country_name_xml xml PATH 'COUNTRY_NAME/text()' NOT NULL,
> country_name_n text PATH 'COUNTRY_NAME' NOT NULL,
> country_name_xml_n xml PATH 'COUNTRY_NAME' NOT NULL);
> country_name │ country_name_xml │ country_name_n │ country_name_xml_n
> ────────────────┼──────────────────┼────────────────┼───────────────────────────────────────────────────────
> Australia │ Australia │ Australia │ <COUNTRY_NAME>Australia</COUNTRY_NAME>
> China │ China │ China │ <COUNTRY_NAME>China</COUNTRY_NAME>
> HongKong │ HongKong │ HongKong │ <COUNTRY_NAME>Hong<!-- a space -->Kong</COUNTRY_NAME>
> India │ India │ India │ <COUNTRY_NAME>India</COUNTRY_NAME>
> Japan │ Japan │ Japan │ <COUNTRY_NAME>Japan</COUNTRY_NAME>
> Singapore │ Singapore │ Singapore │ <COUNTRY_NAME>Singapore</COUNTRY_NAME>
> Czech Republic │ Czech Republic │ Czech Republic │ <COUNTRY_NAME>Czech Republic</COUNTRY_NAME>
> Germany │ Germany │ Germany │ <COUNTRY_NAME>Germany</COUNTRY_NAME>
> France │ France │ France │ <COUNTRY_NAME>France</COUNTRY_NAME>
> Egypt │ Egypt │ Egypt │ <COUNTRY_NAME>Egypt</COUNTRY_NAME>
> Sudan │ Sudan │ Sudan │ <COUNTRY_NAME>Sudan</COUNTRY_NAME>
> (11 filas)
>
>
> This means that the concatenation now works for all types, not just xml, so I
> can do this also:
>
> update xmldata set data = regexp_replace(data::text, '791', '7<!--small country-->91')::xml;
>
> SELECT xmltable.*
> FROM (SELECT data FROM xmldata) x,
> LATERAL XMLTABLE('/ROWS/ROW'
> PASSING data
> COLUMNS
> country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
> size_text float PATH 'SIZE/text()',
> "SIZE" float, size_xml xml PATH 'SIZE');
> country_name │ size_text │ SIZE
> ────────────────┼───────────┼──────
> Australia │ │
> China │ │
> HongKong │ │
> India │ │
> Japan │ │
> Singapore │ 791 │ 791
> Czech Republic │ │
> Germany │ │
> France │ │
> Egypt │ │
> Sudan │ │
> (11 filas)
>
> I'm not sure if this concatenation of things that are not text or XML is
> undesirable for whatever reason. Any clues?

As per above the reason against it is standard conformance.

> Also, array indexes behave funny. First let's add more XML comments
> inside that number, and query for the subscripts:
>
> update xmldata set data = regexp_replace(data::text, '7<!--small country-->91', '<!--ah-->7<!--oh-->9<!--uh-->1')::xml;
>
> SELECT xmltable.*
> FROM (SELECT data FROM xmldata) x,
> LATERAL XMLTABLE('/ROWS/ROW'
> PASSING data
> COLUMNS
> country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
> size_text float PATH 'SIZE/text()',
> size_text_1 float PATH 'SIZE/text()[1]',
> size_text_2 float PATH 'SIZE/text()[2]',
> "SIZE" float, size_xml xml PATH 'SIZE')
> where size_text is not null;
>
> country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │ size_xml
> ──────────────┼───────────┼─────────────┼─────────────┼─────────────┼──────┼───────────────────────────────────────────────────────
> Singapore │ 791 │ 791 │ 91 │ 1 │ 791 │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>
> (1 fila)
>
> I don't know what to make of that. Seems pretty broken.

Absolutely.

Also, node() matching comments or processing instructions
seems to be broken too:

SELECT *
FROM (VALUES ('<xml><!--comment--></xml>'::xml)
, ('<xml><?pi content?></xml>'::xml)
) d(x)
CROSS JOIN LATERAL
XMLTABLE('/xml'
PASSING x
COLUMNS "node()" TEXT PATH 'node()'
) t

x | node()
---------------------------+--------
<xml><!--comment--></xml> |
<xml><?pi content?></xml> |
(2 rows)

I can look into this, but it may take a while.

As per my understanding explained above, I’d base future work on my original patch.

> After this, I looked for some examples of XPath syntax in the interwebs.
> I came across the | operator (which apparently is a "union" of some
> sort). Behold:
>
> SELECT xmltable.*
> FROM (SELECT data FROM xmldata) x,
> LATERAL XMLTABLE('/ROWS/ROW'
> PASSING data
> COLUMNS
> country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
> size_text text PATH 'SIZE/text() | SIZE/@unit')
> where size_text is not null ;
>
> country_name │ size_text
> ──────────────┼───────────
> Singapore │ km791
> (1 fila)
>
> The "units" property is ahead of the text(), which is pretty odd.

As per above, I think this is outside the spec (as it matches several nodes).

Other than that it is related to the “document order” in which XPath
generally returns matched nodes (except a few axis that return their matched
in reverse order—e.g. ancestor::)

> But if I
> remove the text() call, it puts the units after the text:
>
> SELECT xmltable.*
> FROM (SELECT data FROM xmldata) x,
> LATERAL XMLTABLE('/ROWS/ROW'
> PASSING data
> COLUMNS
> country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
> size_text text PATH 'SIZE | SIZE/@unit')
> where size_text is not null ;
> country_name │ size_text
> ──────────────┼─────────────────────────────────────────────────────────
> Singapore │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>km
> (1 fila)
>
> Again, I don't know what to make of this.

As per above, the document order is:

1. <size> opening tag
2. the @unit attribute for size
3. the text() inside the size.

Now that you match (1) instead of (3), it is returned first.

> Anyway, this seems firmly in
> the libxml side of things; the only conclusion I have is that nobody
> ever uses libxml terribly much for complex XPath processing.
>
>
> Basing another test on your original test case, look at the first row
> here. Is this result correct?
>
> SELECT *
> FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml)
> , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml)
> ) d(x)
> CROSS JOIN LATERAL
> XMLTABLE('/xml'
> PASSING x
> COLUMNS "node()" TEXT PATH 'node()'
> ) t;
> x │ node()
> ───────────────────────────────────────────────────────────┼────────────────────────────────────
> <xml>te<!-- ahoy -->xt</xml> │ te ahoy xt
> <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff>
>
> Why was the comment contents expanded inside node()?

Because node() matches also comments and your patches concatenates all matches.

> Note what happens
> if I change the type from text to xml in that column:
>
> SELECT *
> FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml)
> , ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml)
> ) d(x)
> CROSS JOIN LATERAL
> XMLTABLE('/xml'
> PASSING x
> COLUMNS "node()" xml PATH 'node()'
> ) t;
>
> x │ node()
> ───────────────────────────────────────────────────────────┼────────────────────────────────────────────────
> <xml>te<!-- ahoy -->xt</xml> │ te ahoy xt
> <xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some &lt;!-- really --&gt; weird &lt;stuff&gt;
> (2 filas)

The comment seems to be wrong.

I guess it’s fine if the CDATA gets transformed in to an equivalent
string using the XML entities. Yet, it might be better avoiding it.

Will look into it, but may take a while.

> Further note that, per the standard, you can omit the PATH clause in
> which case the column name is used as the path, which seems to work
> correctly. Phew!
>
>> * Patch 0002: Set correct context for XPath evaluation.
>>
>>> According to the ISO standard, the context of XMLTABLE's XPath
>>> row_expression is the document node of the XML[1] - not the root node.
>>>
>>> The respective code is shared with non-ISO functions such as xpath
>>> and xpath_exists so that the change also affects these functions.
>>
>> It’s an incompatible change - even the regression tests need to be adjusted because they
>> test for the “wrong” behaviour.
>
> I'm really afraid to change the non-ISO functions in PG10, since
> it's already released for quite some time with this long-standing
> behavior; and I'm not sure it'll do much good to change XMLTABLE in PG10
> and leave the non-iso functions unpatched. I think the easiest route is
> to patch all functions only for PostgreSQL 11. Maybe if XMLTABLE is
> considered unacceptably broken in PostgreSQL 10 we can patch only that
> one.

I’m in favour of doing all at once in PG11.

As the XMLTABLE examples in the docs consistently use absolute XPath
expressions, there is hope that people do it this way so that they won’t be
affected by the patch.

When I stumbled upon this issue, I first though that the context is not set
at all. I did not even come to my mind that the context could be there,
but wrongly set. I only found out that it actually is set when I made the patch.

In other words, I would not have made code relying on the wrong behaviour.
Unfortunately, this argument doesn’t hold for the non-ISO functions, which
are also longer in PostgreSQL.

-markus

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Adrien Nayrat 2018-04-30 14:32:26 Re: Explain buffers wrong counter with parallel plans
Previous Message Pierre Ducroquet 2018-04-30 11:05:36 Re: [RFC] Add an until-0 loop in psql