Re: possible encoding issues with libxml2 functions

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: possible encoding issues with libxml2 functions
Date: 2017-08-20 06:46:03
Message-ID: CAFj8pRBzncL7_khAwfEQzNqOJp9hjma3-EtVZxVU=VgKSXLFzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2017-08-20 4:17 GMT+02:00 Noah Misch <noah(at)leadboat(dot)com>:

> On Fri, Aug 18, 2017 at 11:43:19PM +0200, Pavel Stehule wrote:
> > yes, probably libXML2 try to do check from utf8 encoding to header
> > specified encoding.
>
> Yes. That has been the topic of this thread.
>
> > a) all values created by xml_in iterface are in database encoding -
> input
> > string is stored without any change. xml_parse is called only due
> > validation.
> >
> > b) inside xml_parse, the input is converted to UTF8, and document is read
> > by xmlCtxtReadDoc with explicitly specified "UTF-8" encoding or
> > by xmlParseBalancedChunkMemory with explicitly specified encoding "UTF8"
> > and removed decl section.
> >
> > So for "xml_parse" based functions (xml_in, texttoxml, xml_is_document,
> > wellformated_xml) the database encoding is not important
> >
> > c) xml_recv function does validation by xml_parse and translation to
> > database encoding.
> >
> > Now I don't see a difference between @b and @c - so my hypotheses about
> > necessity to use recv interface was wrong.
>
> Yes. You posted, on 2017-04-05, a test case not requiring the recv
> interface.
>
> On Sat, Aug 19, 2017 at 09:13:50AM +0200, Pavel Stehule wrote:
> > I didn't find any info how to enable libXML2 XPath functions for other
> > encoding than UTF8 :( ??
>
> http://xmlsoft.org/encoding.html is the relevant authority. To
> summarize, we
> should send only UTF8 to libxml2.
>

libxml2 encodes XML to UTF8 by self. All others should be in UTF8. I found
some references to xmlSwitchEncoding function - but I didn't find any
examples of usage - probably nobody use it. Result is in UTF8 always.

>
> On Sat, Aug 19, 2017 at 10:53:19PM +0200, Pavel Stehule wrote:
> > I am sending some POC - it does support XPATH and XMLTABLE for not UTF8
> > server encoding.
> >
> > In this case, all strings should be converted to UTF8 before call libXML2
> > functions, and result should be converted back from UTF8.
>
> Adding support for xpath in non-UTF8 databases is a v11 feature proposal.
> Please start a new thread for this, and add it to the open CommitFest.
>
> In this thread, would you provide the version of your patch that I
> described
> in my 2017-08-08 post to this thread? That's a back-patchable bug fix.

There are three issues:

1. processing 1byte encoding XMLs documents with encoding declaration -
should be fixed by ecoding_for_xmlCtxtReadMemory.patch. This patch is very
short and safe - can be apply immediately (there is regress tests)

2 encoding issues in XPath specification (and namespaces) - because
multibytes chars are not usually used in tag names, this issue hit minimum
users.

3. encoding issues in XPath and XMLTABLE results - this is bad issue - the
function XMLTABLE will not be functional on non UTF8 databases. Fortunately
- there are less users with this encoding, but probably should be apply as
fix in 10/11 Postgres.

> I found some previous experiments https://marc.info/?l=pgsql-
> bugs&m=123407176408688
>
> https://wiki.postgresql.org/wiki/Todo#XML links to other background on
> this
> feature proposal. See Tom Lane's review of a previous patch. Ensure your
> patch does not have the problems he found during that review. Do that
> before
> starting a thread for this feature.
>

good information - thank you. I'll start new thread for @2 and @3 issues -
not sure if I prepare good enough patch for next commit fest - and later
commiter can decide if will do backpatching.

Regards

Pavel

>
> Thanks,
> nm
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2017-08-20 07:21:26 Re: possible encoding issues with libxml2 functions
Previous Message Noah Misch 2017-08-20 02:17:34 Re: possible encoding issues with libxml2 functions