Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Markus Winand <markus(dot)winand(at)winand(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context
Date: 2018-04-26 19:18:28
Message-ID: 20180426191828.4wqpki2slu2ec6n7@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Markus,

Markus Winand wrote:

> * Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression.
>
> > Column_expressions that match TEXT or CDATA nodes must return
> > the contents of the node itself, not the content of the
> > non-existing childs (i.e. the empty string).
>
> The following query returns a wrong result in master:
>
> SELECT *
> FROM (VALUES ('<xml>text</xml>'::xml)
> , ('<xml><![CDATA[<cdata>]]></xml>'::xml)
> ) d(x)
> CROSS JOIN LATERAL
> XMLTABLE('/xml'
> PASSING x
> COLUMNS "node()" TEXT PATH 'node()'
> ) t

> The correct result can be obtained with patch 0001 applied:
>
> x | node()
> --------------------------------+---------
> <xml>text</xml> | text
> <xml><![CDATA[<cdata>]]></xml> | <cdata>
>
> The patch adopts existing tests to guard against future regressions.

I remembered while reading this that Craig Ringer had commented that XML
text sections can have multiple children: just put XML comments amidst
the text. To test this, I added a comment in one of the text values, in
the regression database:

update xmldata set data = regexp_replace(data::text, 'HongKong', 'Hong<!-- a space -->Kong')::xml;

With the modified data, the new query in the regression tests fails:

SELECT xmltable.*
FROM (SELECT data FROM xmldata) x,
LATERAL XMLTABLE('/ROWS/ROW'
PASSING data
COLUMNS id int PATH '@id',
_id FOR ORDINALITY,
country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
country_id text PATH 'COUNTRY_ID',
region_id int PATH 'REGION_ID',
size float PATH 'SIZE',
unit text PATH 'SIZE/@unit',
premier_name text PATH 'PREMIER_NAME' DEFAULT 'not specified');

ERROR: more than one value returned by column XPath expression

This seems really undesirable, so I looked into getting it fixed. Turns
out that this error is being thrown from inside the same function we're
modifying, line 4559. I didn't have a terribly high opinion of that
ereport() when working on this feature to begin with, so now that it's
proven to provide a bad experience, let's try removing it. With that
change (patch attached), the feature seems to work; I tried with this
query, which seems to give the expected output (notice we have some
columns of type xml, others of type text, with and without the text()
function in the path):

SELECT xmltable.*
FROM (SELECT data FROM xmldata) x,
LATERAL XMLTABLE('/ROWS/ROW'
PASSING data COLUMNS
country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
country_name_xml xml PATH 'COUNTRY_NAME/text()' NOT NULL,
country_name_n text PATH 'COUNTRY_NAME' NOT NULL,
country_name_xml_n xml PATH 'COUNTRY_NAME' NOT NULL);
country_name │ country_name_xml │ country_name_n │ country_name_xml_n
────────────────┼──────────────────┼────────────────┼───────────────────────────────────────────────────────
Australia │ Australia │ Australia │ <COUNTRY_NAME>Australia</COUNTRY_NAME>
China │ China │ China │ <COUNTRY_NAME>China</COUNTRY_NAME>
HongKong │ HongKong │ HongKong │ <COUNTRY_NAME>Hong<!-- a space -->Kong</COUNTRY_NAME>
India │ India │ India │ <COUNTRY_NAME>India</COUNTRY_NAME>
Japan │ Japan │ Japan │ <COUNTRY_NAME>Japan</COUNTRY_NAME>
Singapore │ Singapore │ Singapore │ <COUNTRY_NAME>Singapore</COUNTRY_NAME>
Czech Republic │ Czech Republic │ Czech Republic │ <COUNTRY_NAME>Czech Republic</COUNTRY_NAME>
Germany │ Germany │ Germany │ <COUNTRY_NAME>Germany</COUNTRY_NAME>
France │ France │ France │ <COUNTRY_NAME>France</COUNTRY_NAME>
Egypt │ Egypt │ Egypt │ <COUNTRY_NAME>Egypt</COUNTRY_NAME>
Sudan │ Sudan │ Sudan │ <COUNTRY_NAME>Sudan</COUNTRY_NAME>
(11 filas)

This means that the concatenation now works for all types, not just xml, so I
can do this also:

update xmldata set data = regexp_replace(data::text, '791', '7<!--small country-->91')::xml;

SELECT xmltable.*
FROM (SELECT data FROM xmldata) x,
LATERAL XMLTABLE('/ROWS/ROW'
PASSING data
COLUMNS
country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
size_text float PATH 'SIZE/text()',
"SIZE" float, size_xml xml PATH 'SIZE');
country_name │ size_text │ SIZE
────────────────┼───────────┼──────
Australia │ │
China │ │
HongKong │ │
India │ │
Japan │ │
Singapore │ 791 │ 791
Czech Republic │ │
Germany │ │
France │ │
Egypt │ │
Sudan │ │
(11 filas)

I'm not sure if this concatenation of things that are not text or XML is
undesirable for whatever reason. Any clues?

Also, array indexes behave funny. First let's add more XML comments
inside that number, and query for the subscripts:

update xmldata set data = regexp_replace(data::text, '7<!--small country-->91', '<!--ah-->7<!--oh-->9<!--uh-->1')::xml;

SELECT xmltable.*
FROM (SELECT data FROM xmldata) x,
LATERAL XMLTABLE('/ROWS/ROW'
PASSING data
COLUMNS
country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
size_text float PATH 'SIZE/text()',
size_text_1 float PATH 'SIZE/text()[1]',
size_text_2 float PATH 'SIZE/text()[2]',
"SIZE" float, size_xml xml PATH 'SIZE')
where size_text is not null;

country_name │ size_text │ size_text_1 │ size_text_2 │ size_text_3 │ SIZE │ size_xml
──────────────┼───────────┼─────────────┼─────────────┼─────────────┼──────┼───────────────────────────────────────────────────────
Singapore │ 791 │ 791 │ 91 │ 1 │ 791 │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>
(1 fila)

I don't know what to make of that. Seems pretty broken.

After this, I looked for some examples of XPath syntax in the interwebs.
I came across the | operator (which apparently is a "union" of some
sort). Behold:

SELECT xmltable.*
FROM (SELECT data FROM xmldata) x,
LATERAL XMLTABLE('/ROWS/ROW'
PASSING data
COLUMNS
country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
size_text text PATH 'SIZE/text() | SIZE/@unit')
where size_text is not null ;

country_name │ size_text
──────────────┼───────────
Singapore │ km791
(1 fila)

The "units" property is ahead of the text(), which is pretty odd. But if I
remove the text() call, it puts the units after the text:

SELECT xmltable.*
FROM (SELECT data FROM xmldata) x,
LATERAL XMLTABLE('/ROWS/ROW'
PASSING data
COLUMNS
country_name text PATH 'COUNTRY_NAME/text()' NOT NULL,
size_text text PATH 'SIZE | SIZE/@unit')
where size_text is not null ;
country_name │ size_text
──────────────┼─────────────────────────────────────────────────────────
Singapore │ <SIZE unit="km"><!--ah-->7<!--oh-->9<!--uh-->1</SIZE>km
(1 fila)

Again, I don't know what to make of this. Anyway, this seems firmly in
the libxml side of things; the only conclusion I have is that nobody
ever uses libxml terribly much for complex XPath processing.

Basing another test on your original test case, look at the first row
here. Is this result correct?

SELECT *
FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml)
, ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml)
) d(x)
CROSS JOIN LATERAL
XMLTABLE('/xml'
PASSING x
COLUMNS "node()" TEXT PATH 'node()'
) t;
x │ node()
───────────────────────────────────────────────────────────┼────────────────────────────────────
<xml>te<!-- ahoy -->xt</xml> │ te ahoy xt
<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some <!-- really --> weird <stuff>

Why was the comment contents expanded inside node()? Note what happens
if I change the type from text to xml in that column:

SELECT *
FROM (VALUES ('<xml>te<!-- ahoy -->xt</xml>'::xml)
, ('<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml>'::xml)
) d(x)
CROSS JOIN LATERAL
XMLTABLE('/xml'
PASSING x
COLUMNS "node()" xml PATH 'node()'
) t;

x │ node()
───────────────────────────────────────────────────────────┼────────────────────────────────────────────────
<xml>te<!-- ahoy -->xt</xml> │ te ahoy xt
<xml><![CDATA[some <!-- really --> weird <stuff>]]></xml> │ some &lt;!-- really --&gt; weird &lt;stuff&gt;
(2 filas)

Further note that, per the standard, you can omit the PATH clause in
which case the column name is used as the path, which seems to work
correctly. Phew!

> * Patch 0002: Set correct context for XPath evaluation.
>
> > According to the ISO standard, the context of XMLTABLE's XPath
> > row_expression is the document node of the XML[1] - not the root node.
> >
> > The respective code is shared with non-ISO functions such as xpath
> > and xpath_exists so that the change also affects these functions.
>
> It’s an incompatible change - even the regression tests need to be adjusted because they
> test for the “wrong” behaviour.

I'm really afraid to change the non-ISO functions in PG10, since
it's already released for quite some time with this long-standing
behavior; and I'm not sure it'll do much good to change XMLTABLE in PG10
and leave the non-iso functions unpatched. I think the easiest route is
to patch all functions only for PostgreSQL 11. Maybe if XMLTABLE is
considered unacceptably broken in PostgreSQL 10 we can patch only that
one.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
v2-0001-Accept-TEXT-and-CDATA-nodes-in-XMLTABLE-s-column_exp.patch text/plain 14.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-04-26 19:25:22 Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Previous Message Jason Petersen 2018-04-26 19:16:09 Re: Setting rpath on llvmjit.so?