Strange output of XML attribute values

From: Andrew Marynchuk (Андрей Маринчук) <radist(dot)nt(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Strange output of XML attribute values
Date: 2020-09-16 12:09:52
Message-ID: CAJt8d+D3xe6bPJz7W7Acrwtrturpm+VygrnBQe4r0VQGTeYCpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

This problem is quite old, but it leads to the inability to use XML
generation functions in PostgreSQL database for some cases, or at least
requires to perform subsequent parsing and regenerating XML by an external
utility. It reproduces in PostgreSQL 12.4, compiled by Visual C++ build
1914, 64-bit (windows 10), but I've seen the same problem in 9.6 build from
CentOS yum package.

*How to reproduce*:
Just execute the query (actually the xmlelement call is enough to reproduce
the proble):
select xmlserialize(document xmlroot(xmlelement(name "ЭлементВКириллице",
xmlattributes('ЗначениеВКириллице' as "АтрибутВКириллице"),
'ТекстВКириллице'), version '1.0', standalone yes) as text);

*Expected result*:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице
АтрибутВКириллице="ЗначениеВКириллице">ТекстВКириллице</ЭлементВКириллице>

*Actual result*:
<?xml version="1.0" standalone="yes"?><ЭлементВКириллице
АтрибутВКириллице="&#x417;&#x43D;&#x430;&#x447;&#x435;&#x43D;&#x438;&#x435;&#x412;&#x41A;&#x438;&#x440;&#x438;&#x43B;&#x43B;&#x438;&#x446;&#x435;">ТекстВКириллице</ЭлементВКириллице>

This example uses cyrillic letters, but it could be any non-ASCII character.
According to the discussion
<https://www.sql.ru/forum/775061/russkiy-yazyk-v-xml?hl=libxml>, this
problem arises because PostgreSQL does not provides libxml2 an information
of document encoding due to the lack of xmlTextWriterStartDocument call, so
libxml2 has no idea that encoding is UTF-8 and non-ASCII characters could
be written without converting to &#x...;-sequences.

In the modern world, UTF-8 encoding is used everywhere and such unnecessary
character converting looks strange. Current workaround is passing generated
content to the pl/python function which parses and writes back the xml
(xml.dom.minidom.parseString(...).toxml()).

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Pavel Stehule 2020-09-16 12:50:30 Re: Strange output of XML attribute values
Previous Message PG Bug reporting form 2020-09-16 10:34:58 BUG #16619: Amcheck detects corruption in hstore' btree index (ver 2)