Re: possible encoding issues with libxml2 functions

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: possible encoding issues with libxml2 functions
Date: 2017-08-20 08:54:57
Message-ID: CAFj8pRBpUONGfVh-ixD-tje_tfjr0gSi6gjcAYNBcbyTgQttaQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2017-08-20 9:21 GMT+02:00 Noah Misch <noah(at)leadboat(dot)com>:

> On Sun, Aug 20, 2017 at 08:46:03AM +0200, Pavel Stehule wrote:
> > 2017-08-20 4:17 GMT+02:00 Noah Misch <noah(at)leadboat(dot)com>:
> > > On Sat, Aug 19, 2017 at 10:53:19PM +0200, Pavel Stehule wrote:
> > > > I am sending some POC - it does support XPATH and XMLTABLE for not
> UTF8
> > > > server encoding.
> > > >
> > > > In this case, all strings should be converted to UTF8 before call
> libXML2
> > > > functions, and result should be converted back from UTF8.
> > >
> > > Adding support for xpath in non-UTF8 databases is a v11 feature
> proposal.
> > > Please start a new thread for this, and add it to the open CommitFest.
> > >
> > > In this thread, would you provide the version of your patch that I
> > > described
> > > in my 2017-08-08 post to this thread? That's a back-patchable bug fix.
> >
> >
> > There are three issues:
> >
> > 1. processing 1byte encoding XMLs documents with encoding declaration -
> > should be fixed by ecoding_for_xmlCtxtReadMemory.patch. This patch is
> very
> > short and safe - can be apply immediately (there is regress tests)
>
> We should fix that problem, yes. encoding_for_xmlCtxtReadMemory.patch is
> not
> the right fix; see below.
>
> > 2 encoding issues in XPath specification (and namespaces) - because
> > multibytes chars are not usually used in tag names, this issue hit
> minimum
> > users.
> >
> > 3. encoding issues in XPath and XMLTABLE results - this is bad issue -
> the
> > function XMLTABLE will not be functional on non UTF8 databases.
> Fortunately
> > - there are less users with this encoding, but probably should be apply
> as
> > fix in 10/11 Postgres.
>
> (2) and (3) are long-documented (since commit 14180f9, 2009-06)
> limitations in
> xpath functions. That's why I would treat an improvement as a new feature
> and
> not back-patch it. It is tempting to fix v10 so XMLTABLE is born without
> this
> limitation, but it is too late in the release cycle.
>

I agree

>
>
> Reviewing encoding_for_xmlCtxtReadMemory.patch:
>
> On Sat, Aug 19, 2017 at 09:13:50AM +0200, Pavel Stehule wrote:
> > - doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
> NULL, 0);
> > + doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
> > +
> pg_encoding_to_char(GetDatabaseEncoding()), 0);
>
> This assumes libxml2 understands every encoding name that
> pg_encoding_to_char() can return, but it does not. xpath-bugfix.patch was
> better. Here are the relevant parts of my review of that patch.
>

I understand to this argument.

>
> On Mon, Aug 07, 2017 at 05:10:14PM -0700, Noah Misch wrote:
> > On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
> > > One possible fix - and similar technique is used more times in xml.c
> code
> > > .. xmlroot
> >
> > > + /* remove header */
> > > + parse_xml_decl(string, &header_len, NULL, NULL, NULL);
> >
> > > - doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
> NULL, 0);
> > > + doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len,
> len -
>
> > I like this parse_xml_decl() approach better. Would you update
> > your patch to use it and to add the test case I give above?
>
> Again, would you make those two edits?
>

Now, I am not sure so it is correct fix. We will fix case, when server is
in UTF8, but maybe we will break some cases when server is not in UTF8
(although it is broken already).

I am think so correct solution is encoding to UTF8 and passing encoding
parameter. It will works safely in all cases - and we don't need cut
header. We should not to cut header if server encoding is not in UTF8 and
we don't pass encoding as parameter. When we pass encoding as parameter,
then cutting header is not necessary.

What do you think about last patch?

Regards

Pavel

>
> Thanks,
> nm
>

Attachment Content-Type Size
xpath-parsing-error-fix.patch text/x-patch 7.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikolay Shaplov 2017-08-20 12:53:28 Re: pgbench tap tests & minor fixes
Previous Message Noah Misch 2017-08-20 07:21:26 Re: possible encoding issues with libxml2 functions