Re: Strange output of XML attribute values

From: Andrew Marynchuk (Андрей Маринчук) <radist(dot)nt(at)gmail(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: radist-hack(at)yandex(dot)ru, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Strange output of XML attribute values
Date: 2020-09-16 20:23:03
Message-ID: CAJt8d+DyCPe6RKU+tc0_Yb2B6MxCbE2F1vgrBvyMhRvNhfnREg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

ср, 16 сент. 2020 г. в 15:51, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:

>
>
> st 16. 9. 2020 v 14:11 odesílatel Andrew Marynchuk (Андрей Маринчук) <
> radist(dot)nt(at)gmail(dot)com> napsal:
>
>> This problem is quite old, but it leads to the inability to use XML
>> generation functions in PostgreSQL database for some cases, or at least
>> requires to perform subsequent parsing and regenerating XML by an external
>> utility. It reproduces in PostgreSQL 12.4, compiled by Visual C++ build
>> 1914, 64-bit (windows 10), but I've seen the same problem in 9.6 build from
>> CentOS yum package.
>>
>> *How to reproduce*:
>> Just execute the query (actually the xmlelement call is enough to
>> reproduce the proble):
>> select xmlserialize(document xmlroot(xmlelement(name "ЭлементВКириллице",
>> xmlattributes('ЗначениеВКириллице' as "АтрибутВКириллице"),
>> 'ТекстВКириллице'), version '1.0', standalone yes) as text);
>>
>> *Expected result*:
>> <?xml version="1.0" standalone="yes"?><ЭлементВКириллице
>> АтрибутВКириллице="ЗначениеВКириллице">ТекстВКириллице</ЭлементВКириллице>
>>
>> *Actual result*:
>> <?xml version="1.0" standalone="yes"?><ЭлементВКириллице
>> АтрибутВКириллице="&#x417;&#x43D;&#x430;&#x447;&#x435;&#x43D;&#x438;&#x435;&#x412;&#x41A;&#x438;&#x440;&#x438;&#x43B;&#x43B;&#x438;&#x446;&#x435;">ТекстВКириллице</ЭлементВКириллице>
>>
>> This example uses cyrillic letters, but it could be any non-ASCII
>> character.
>> According to the discussion
>> <https://www.sql.ru/forum/775061/russkiy-yazyk-v-xml?hl=libxml>, this
>> problem arises because PostgreSQL does not provides libxml2 an information
>> of document encoding due to the lack of xmlTextWriterStartDocument call, so
>> libxml2 has no idea that encoding is UTF-8 and non-ASCII characters could
>> be written without converting to &#x...;-sequences.
>>
>
> I don't think it is true. The url encoding is done only in attributes, and
> only when the output encoding will be utf8. When you try to use 8bit
> encoding with Azbuka support, it will be ok.
>

I've made some investigation on libxml2 sources
<https://gitlab.gnome.org/GNOME/libxml2>.
Seems like urlencoding is done in xmlBufAttrSerializeTxtContent function
from xmlsave.c after the condition
<https://gitlab.gnome.org/GNOME/libxml2/-/blob/00a86d414ba9a9e1cd588182b87518e4e3af9466/xmlsave.c#L2035>
is met. The key part is (doc == NULL) || (doc->encoding == NULL).
The xmlTextWriterWriteString is the only function calls
xmlBufAttrSerializeTxtContent in xmlwriter.c and only for
XML_TEXTWRITER_ATTRIBUTE state
<https://gitlab.gnome.org/GNOME/libxml2/-/blob/00a86d414ba9a9e1cd588182b87518e4e3af9466/xmlwriter.c#L1504>,
so urlencoding is really done only for attributes, but it could be turned
off by providing the xmlwriter with encoding.
But there is no api function except xmlTextWriterStartDocument to modify
the (xmlTextWriterPtr argument) -> doc -> encoding field.

In other words, the libxml2 library does not know the codepage and assumes
it as ASCII, so it turns on the urlencoding of all attributes. I don't have
an idea why all other textual document parts are passed as is.Seems like
PostgreSQL never specifies the codepage (I haven't seen a encoding
declaration in document prolog for documents generated in PostgreSQL). It's
reasonable because charset is not defined for varchar type and xml
converted to varchar is some abstract text which could be converted to any
codepage.

>
>>
>> In the modern world, UTF-8 encoding is used everywhere and such
>> unnecessary character converting looks strange. Current workaround is
>> passing generated content to the pl/python function which parses and writes
>> back the xml (xml.dom.minidom.parseString(...).toxml()).
>>
>
> If I remember some discussion about this topic, the problem is XML
> standard, that requires url encoding in attribute values.
>
> I reported this issue 10 (maybe 15) years ago to libxml2 developers, and
> It was rejected. Maybe libxml2 supports too old XML standards. I don't know
> - this library is years in frozen state, but there is no replacement.
>

I traced the description of attributes from the current specification
through previous versions till the draft version published in 1996, but
didn't find such a requirement. Attribute value defined as '"' ([^<&"] |
Reference <https://www.w3.org/TR/2004/REC-xml-20040204/#NT-Reference>)* '"'
| "'" ([^<&'] | Reference
<https://www.w3.org/TR/2004/REC-xml-20040204/#NT-Reference>)* "'", so only
three characters are disallowed in attributes.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2020-09-16 20:35:33 BUG #16621: Unexpected Foreign Key Constraint Violation when Creating New Child Partition
Previous Message Tom Lane 2020-09-16 18:51:38 Re: table partition with inheritance having current_timestamp issue if we miss range table