Re: possible encoding issues with libxml2 functions

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: possible encoding issues with libxml2 functions
Date: 2017-04-05 20:53:39
Message-ID: CAFj8pRAbR5bjBD6WakseDt265WT7E+G=YXndGbk9CTYZvX9peQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2017-03-17 4:23 GMT+01:00 Noah Misch <noah(at)leadboat(dot)com>:

> On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote:
> > 2017-03-12 21:57 GMT+01:00 Noah Misch <noah(at)leadboat(dot)com>:
> > > On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
> > > > 2017-03-12 0:56 GMT+01:00 Noah Misch <noah(at)leadboat(dot)com>:
> > > Please add a test case.
>

I am sending test case.

I am afraid so this cannot be part of regress tests due strong dependency
on locale win1250.

> >
> > It needs a application - currently there is not possibility to import XML
> > document via recv API :(
>
> I think xml_in() can create every value that xml_recv() can create;
> xml_recv()
> is just more convenient given diverse source encodings. If you make your
> application store the value into a table, does "pg_dump --inserts" emit
> code
> that reproduces the same value? If so, you can use that in your test case.
> If not, please provide precise instructions (code, SQL commands) for
> reproducing the bug manually.
>

I was wrong - both methods produces broken internal XML document

<?xml version="1.0" encoding="windows-1250"?>
<enprimeur>
<vino>
<id>1</id>
<nazev>Alter Ego de Palmer</nazev>
<vyrobce>63</vyrobce>
<rocnik>2012</rocnik>
<cena0375>0</cena0375>
<cena1500>0</cena1500>
<cena3000>0</cena3000>
<cena6000>0</cena6000>
<cena0750>1425</cena0750>
<cenastart>1085</cenastart>
<min0375>0</min0375>
<min0750>0</min0750>
<odrudy>51 % Merlot, 40 % Cabernet Sauvignon,9 %
Petit Verdot</odrudy>
<bestin>2017 - 2026</bestin>
<klas>2</klas>
<sklad0375>0</sklad0375>
<sklad0750>0</sklad0750>
<sklad1500>0</sklad1500>
<sklad3000>0</sklad3000>
<sklad6000>0</sklad6000>
<alk>13,4 %</alk>
<remark>Premiant oblasti Margaux Ch. Palmer
tentokrát ve svých obou vínech těžil z dokonale zralého Merlotu, kterého do
svých směsí naládoval více než Cabernet Sauvignonu. 2. víno Alter Ego de
Palmer 2012 se může p
ochlubit skvělou kondicí. Není pochyb o tom, že na Ch. Palmer sklízeli z
vinice dokonale zralé hrozny. Alter Ego je mohutné, husté a syté víno,
nabízí objemnou dávku ovoce, voní po ostružinách, chuť je krásně kulatá,
pevná a svěží, taniny
vykazují fenomenální strukturu, takto krásné spojení všech aromatických a
chuťových složek s příměsí úžasných kyselin a alkoholu je radost mít ve
sklepě.</remark>
<rating>Robert Parker: /100
TOPVINO SCORE: 92-94/100
James Suckling: 92-93/100
Wine Spectator: 90-93/100</rating>
<zalozeno></zalozeno>
<rozloha></rozloha>
<stari></stari>
<puda></puda>
<produkce></produkce>
<zrani></zrani>
<active>1</active>
<stitky>
<stitek>8</stitek>
<stitek>1</stitek>
</stitky>
</vino>
</enprimeur>

Document is well encoded from win1250 to UTF8 - it is well readable, but
the header is wrong - it is not in win1250 now (after encoding). This xml
is attached (in original form (win1250 encoding)).

Import with

create table target(x xml);
\set xml `cat ~/Downloads/e6.xml`
postgres=# set client_encoding to win1250;
postgres=# insert into target values(:'xml');

disconnect or reset client encoding

Document should be correctly inserted.

Then run a query

postgres=# select * from target, xpath('/enprimeur/vino', x);
ERROR: could not parse XML document
DETAIL: input conversion failed due to input error, bytes 0x88 0x73 0x6B
0x75
input conversion failed due to input error, bytes 0x88 0x73 0x6B 0x75
encoder error
line 25: Premature end of data in tag remark line 25
line 25: Premature end of data in tag vino line 3
line 25: Premature end of data in tag enprimeur line 2

select xmltable.* from target, xmltable('/enprimeur/vino' passing x columns
id int, nazev text, remark text);

XMLTABLE is not vulnerable against this bug and result should be correct.

when you do

select x from target

you can see a correct xml without header

but select x::text from target; shows wrong header

>
> > > Why not use xml_parse() instead of calling xmlCtxtReadMemory()
> directly?
> > > The
> > > answer is probably in the archives, because someone understood the
> problem
> > > enough to document "Some XML-related functions may not work at all on
> > > non-ASCII data when the server encoding is not UTF-8. This is known to
> be
> > > an
> > > issue for xpath() in particular."
> >
> >
> > Probably there are two possible issues
>
> Would you research in the archives to confirm?
>
> > 1. what I touched - recv function does encoding to database encoding -
> but
> > document encoding is not updated.
>
> Using xml_parse() would fix that, right?
>

It should to help, because it cuts a header - but it does little bit more
work, than we would - it check if xml is doc or not too.

In mostly functions there the important work is done in parse_xml_decl
function

One possible fix - and similar technique is used more times in xml.c code
.. xmlroot

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 2f87151..45faf83 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3834,6 +3834,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
ArrayType *namespaces,
Datum *ns_names_uris;
bool *ns_names_uris_nulls;
int ns_count;
+ size_t header_len;

/*
* Namespace mappings are passed as text[]. If an empty array is passed
@@ -3882,6 +3883,10 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
ArrayType *namespaces,
errmsg("empty XPath expression")));

string = pg_xmlCharStrndup(datastr, len);
+
+ /* remove header */
+ parse_xml_decl(string, &header_len, NULL, NULL, NULL);
+
xpath_expr = pg_xmlCharStrndup(VARDATA_ANY(xpath_expr_text), xpath_len);

xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_ALL);
@@ -3898,7 +3903,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
ArrayType *namespaces,
if (ctxt == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
"could not allocate parser context");
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+ doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -
header_len, NULL, NULL, 0);
if (doc == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");

another possibility is using xml_out_internal - that is used in XMLTABLE
function - what was initial fix.

xml_out_internal uses parse_xml_decl too, but it is little bit more
expensive due call print_xml_decl that has not any effect in this case
(where only encoding is interesting) - but it can has sense, when server
encoding is not UTF8, because in this case, xmlstr is not encoded to UTF8 -
so now, I am think the correct solution should be using xml_out_internal -
because it appends a header with server encoding to XML doc

Regards

Pavel

>
> > 2. there are not possibility to encode from document encoding to database
> > encoding.
>
> Both xml_in() and xml_recv() require the value to be representable in the
> database encoding, so I don't think this particular problem can remain by
> the
> time we reach an xpath_internal() call.
>

Attachment Content-Type Size
e6.xml text/xml 1.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-04-05 21:22:34 Re: PoC plpgsql - possibility to force custom or generic plan
Previous Message Andres Freund 2017-04-05 20:51:51 Re: pg_stat_wal_write statistics view