Re: possible encoding issues with libxml2 functions

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: possible encoding issues with libxml2 functions
Date: 2017-03-13 06:15:06
Message-ID: CAFj8pRCA0GWCLdwsPMFS2tgw+Fy3WQkSFzC7T-5xj+5hcyQs_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2017-03-12 22:26 GMT+01:00 Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:

>
>
> 2017-03-12 21:57 GMT+01:00 Noah Misch <noah(at)leadboat(dot)com>:
>
>> On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
>> > 2017-03-12 0:56 GMT+01:00 Noah Misch <noah(at)leadboat(dot)com>:
>> > > On Mon, Feb 20, 2017 at 07:48:18PM +0100, Pavel Stehule wrote:
>> > > > There are possible two fixes
>> > > >
>> > > > a) clean decl on input - the encoding info can be removed from decl
>> part
>> > > >
>> > > > b) use xml_out_internal everywhere before transformation to
>> > > > xmlChar. pg_xmlCharStrndup can be good candidate.
>> > >
>> > > I'd prefer (a) if the xml type were a new feature, because no good can
>> > > come of
>> > > storing an encoding in each xml field when we know the actual
>> encoding is
>> > > the
>> > > database encoding. However, if you implemented (a), we'd still see
>> > > untreated
>> > > values brought over via pg_upgrade. Therefore, I would try (b)
>> first. I
>> > > suspect the intent of xml_parse() was to implement (b); it will be
>> > > interesting
>> > > to see your test case that malfunctions.
>> > >
>> >
>> > I looked there again and I found so this issue is related to xpath
>> function
>> > only
>> >
>> > Functions based on xml_parse are working without problems.
>> xpath_internal
>> > uses own direct xmlCtxtReadMemory without correct encoding sanitation.
>> >
>> > so fix is pretty simple
>>
>> Please add a test case.
>>
>
> It needs a application - currently there is not possibility to import XML
> document via recv API :(
>
> I wrote a pgimportdoc utility, but it is not part of core
>
>
>>
>> > --- a/src/backend/utils/adt/xml.c
>> > +++ b/src/backend/utils/adt/xml.c
>> > @@ -3874,9 +3874,11 @@ xpath_internal(text *xpath_expr_text, xmltype
>> *data,
>> > ArrayType *namespaces,
>> > ns_count = 0;
>> > }
>> >
>> > - datastr = VARDATA(data);
>> > - len = VARSIZE(data) - VARHDRSZ;
>> > + datastr = xml_out_internal(data, 0);
>>
>> Why not use xml_parse() instead of calling xmlCtxtReadMemory() directly?
>> The
>> answer is probably in the archives, because someone understood the problem
>> enough to document "Some XML-related functions may not work at all on
>> non-ASCII data when the server encoding is not UTF-8. This is known to be
>> an
>> issue for xpath() in particular."
>
>
> Probably there are two possible issues
>
> 1. what I touched - recv function does encoding to database encoding - but
> document encoding is not updated.
>
> 2. there are not possibility to encode from document encoding to database
> encoding.
>
>
>> > + len = strlen(datastr);
>> > +
>> > xpath_len = VARSIZE(xpath_expr_text) - VARHDRSZ;
>> > +
>>
>> The two lines of functional change don't create a cause for more
>> newlines, so
>> don't add these two newlines.
>>
>
> ok
>
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 1908b13db5..2786d5b1cb 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3874,8 +3874,8 @@ xpath_internal(text *xpath_expr_text, xmltype
*data, ArrayType *namespaces,
ns_count = 0;
}

- datastr = VARDATA(data);
- len = VARSIZE(data) - VARHDRSZ;
+ datastr = xml_out_internal(data, 0);
+ len = strlen(datastr);
xpath_len = VARSIZE_ANY_EXHDR(xpath_expr_text);
if (xpath_len == 0)
ereport(ERROR,

Attachment Content-Type Size
xpath-bugfix.patch text/x-patch 511 bytes

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nico Williams 2017-03-13 06:16:51 Re: SQL/JSON in PostgreSQL
Previous Message Nico Williams 2017-03-13 06:14:03 Re: SQL/JSON in PostgreSQL