Re: Support UTF-8 files with BOM in COPY FROM

From: Brar Piening <brar(at)gmx(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-27 05:49:58
Message-ID: 4E816406.1050001@gmx.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> Note that the reference to byte order betrays the implicit context
> assumption: that we're talking about UTF16 or UTF32 representation.
Note that there is no implicit context assumption in the Unicode FAQ.
It's equally covering UTF-8, UTF-16 and UTF-32.
Another quote:
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
yes, then can I still assume the remaining UTF-8 bytes are in big-endian
order?
A: Yes, UTF-8 can contain a BOM. However, it makes /no/ difference as to
the endianness of the byte stream. UTF-8 always has the same byte order.
An initial BOM is /only/ used as a signature --- an indication that an
otherwise unmarked text file is in UTF-8. Note that some recipients of
UTF-8 encoded data do not expect a BOM. Where UTF-8 is
used/transparently/ in 8-bit environments, the use of a BOM will
interfere with any protocol or file format that expects specific ASCII
characters at the beginning, such as the use of "#!" of at the beginning
of Unix shell scripts.
>
> BOM is useless in UTF8, no matter what Microsoft thinks. Any tool that
> relies on it to detect UTF8 data has to have a workaround for overriding
> that detection, or it's broken to the point of uselessness.
This kind of brokenness is currently existing the other way around (see
my reference to the perl script I' using to work aound it).

Note also that I'm not citing a Microsoft FAQ but the Unicode FAQ.
I'm also not trying to convert Postgres into a Microsoft tool (I'm
pretty happy it isn't) but I'm pointing to existing compatibility issues
on a Platform that others have decided to support.
Belonging to the huge group of users who have little or no choice in
what OS they are using and being from a country where plain ASCII isn't
enough to cover all existing characters this is probably fair.

It's a pity that the Unicode standard actually allows something that can
cause problems but blaming the non-platform again doesn't solve the
existing issues.

Regards,

Brar

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2011-09-27 05:51:38 Re: Online base backup from the hot-standby
Previous Message Fujii Masao 2011-09-27 05:00:14 Re: bug of recovery?