Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution

From: Anders Hermansen <anders(at)yoyo(dot)no>
To: Guillaume Cottenceau <gc(at)mnc(dot)ch>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution
Date: 2005-04-27 14:05:34
Message-ID: 20050427140534.GC582@online.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jdbc

* Guillaume Cottenceau (gc(at)mnc(dot)ch) wrote:
> Anders Hermansen <anders 'at' yoyo.no> writes:
> > UTF-8 is a byte sequence, so it's not about the first byte in the whole
> > sequence. But about the first byte in a tree byte sequece.
>
> Yes. I forgot that you assumed the machine was big-endian. So the
> UTF-8 character is here probably first byte 0xEF, second byte
> 0x00?
>
> I did my test with first byte 0x00 and second byte 0xEF, hence
> confusion with your initial comment.
>
> My reasoning was that if the first byte of this two-byte
> sequence is 0x00 then the rule that 0xEF is first byte of a
> three-byte sequence doesn't apply, since 0xEF is second byte in
> the sequence.

Endianness is not a problem when working with a sequnce of bytes (8-bit)
like in utf-8. It only becomes a problem when you deal with more than 1
byte representing 1 value. So it's an issue in UTF-16 which is big-endian by
default I think.

So I interpreted the message "ERROR: could not convert UTF-8 character 0x00ef
to ISO8859-1" as a byte sequence with 0x00 first, and then 0xef. Maybe that's
a wrong assumption.

> > There should be no nul (0) bytes when encoding UTF-8. I believe
> > this is in the specification to allow it to be compatible with
> > C nul-terminated strings.
> >
> > I believe that the byte sequence 0x00EF i illegal UTF-8 because:
> > 1) It contains nul (0x00) byte
> > 2) 0xEF is not followed by two more bytes
> >
> > On the other hand U+00EF is a valid unicode code point. Which points to:
>
> I think this is assumed little-endian, e.g. first byte 0x00 and
> second byte 0xEF (especially because UTF-8 is just a series of
> bytes without any endianness aspects, so it makes good sense to
> actually read this left-to-right, e.g. byte 0x00 first).

As I said above. Endiness is not an issue for UTF-8. The byte _sequence_ is
always read from start to end.

> > LATIN SMALL LETTER I WITH DIAERESIS
> > It is encoded as 0xC3AF in UTF-8
> > As 0x00EF in UTF-16 (and UCS-2 ?)
>
> Yes to "and UCS-2". Two-byte sequences in UCS-2 and UTF-16 are
> the same[1].

Yes.

> > As 0xEF in ISO-8859-1
>
> Hum I think I may understand what's going on here. It's possible
> that in the message:
>
> ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1
>
> when they say "0x00ef" they don't talk about UTF-8 per-see but
> they use the unicode representation (which is error prone).

If 0x00ef refers to a unicode codepoint, it should not have been a problem to
convert it to ISO-8859-1 (0xef).

If 0x00ef refers to a byte sequence, then the error message is a bit
misleading because it's not a character but a byte sequence. And the error
is decoding the UTF-8, not encoding the ISO-8859-1.

Anders Hermansen

In response to

Browse pgsql-jdbc by date

  From Date Subject
Next Message Tom Lane 2005-04-27 14:05:50 Re: _pg_keyposition is gone in HEAD
Previous Message Guillaume Cottenceau 2005-04-27 13:34:28 Re: ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 possiblesolution