Re: XPATH vs. server_encoding != UTF-8

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Subject: Re: XPATH vs. server_encoding != UTF-8
Date: 2011-07-23 16:46:32
Message-ID: 351EED15-E764-4BC1-AC8D-76FFF7E0EC27@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[Resent with pgsql-hackers re-added to the recipient list.
I presume you didn't remove it on purpose]

On Jul23, 2011, at 18:11 , Joey Adams wrote:
> On Sat, Jul 23, 2011 at 11:49 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> So what I think we should do is tell libxml that the encoding is ASCII
>> if the server encoding isn't UTF-8. With that change, the query above
>> produces
>
> I haven't had time to digest this situation, but there is a function
> called pg_encoding_to_char for getting a string representation of the
> encoding. However, it might not produce a string that libxml
> understands in all cases.
>
> Would it be better to tell libxml the server encoding, whatever it may be?

Ultimately, yes. However, I figured if it was as easy as translating our
encoding names to those of libxml, the current code would probably do that
instead of converting the XML to UTF-8 before validating it.
(Validation and XPATH processing use a different code path there!)

I'm also not aware of any actual complaints about XPATH's restriction
to UTF-8, and it's not a case that I personally care for, so I'm
a bit hesitant to put in the time and energy required to extend it to
other encodings.

But once I had stumbled over this, I didn't want to ignore it all together,
so looked for simple way to make the current behaviour more bullet-proof.
The patch accomplishes that, I think, and without any major change in
behaviour. You only observe the difference if you indeed have non-UTF-8
XMLs which look like valid UTF-8.

> In the JSON encoding discussion, the last idea (the one I was planning
> to go with) was to allow non-ASCII characters in any server encoding
> (like ä in ISO-8859-1), but not allow non-ASCII escapes (like \u00E4)
> unless the server encoding is UTF-8.

Yeah, that's how I understood your proposal, and it seems sensible.

> I think your patch would more
> closely match the opposite: allow any escapes, but only allow ASCII
> text if the server encoding is not UTF-8.

Yeah, but only for XPATH(). XML input validation uses a different
code path, and seems to convert the XML to UTF-8 before verifying
it's well-formedness with libxml (as you already discovered previously).

The difference between JSON and XML here is that the XML types has to
live with libxml's idiosyncrasies and restrictions. If we could make
libxml use our encoding and text handling infrastructure, the UTF-8
restrictions would probably not exist. But as it stands, libxml has
it's own machinery for dealing with encodings...

I wonder, BTW, what happens if you attempt to store an XML containing a
character not representable in UNICODE. If the conversion to UTF-8 simply
replaces it with a placeholder, we'd be fine, since just a replacement
cannot affect the well-formedness of an XML. If OTOH it raised an error,
that'd be a bit unfortunate...

best regards,
Florian Pflug

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Marc Munro 2011-07-23 16:57:26 Re: [GENERAL] Dropping extensions
Previous Message Florian Pflug 2011-07-23 15:49:37 XPATH vs. server_encoding != UTF-8