Skip site navigation (1) Skip section navigation (2)

Re: Support UTF-8 files with BOM in COPY FROM

From: Brar Piening <brar(at)gmx(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-27 05:49:58
Message-ID: 4E816406.1050001@gmx.de (view raw or flat)
Thread:
Lists: pgsql-hackers
Tom Lane wrote:
> Note that the reference to byte order betrays the implicit context
> assumption: that we're talking about UTF16 or UTF32 representation.
Note that there is no implicit context assumption in the Unicode FAQ. 
It's equally covering UTF-8, UTF-16 and UTF-32.
Another quote:
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If 
yes, then can I still assume the remaining UTF-8 bytes are in big-endian 
order?
A: Yes, UTF-8 can contain a BOM. However, it makes /no/ difference as to 
the endianness of the byte stream. UTF-8 always has the same byte order. 
An initial BOM is /only/ used as a signature --- an indication that an 
otherwise unmarked text file is in UTF-8. Note that some recipients of 
UTF-8 encoded data do not expect a BOM. Where UTF-8 is 
used/transparently/ in 8-bit environments, the use of a BOM will 
interfere with any protocol or file format that expects specific ASCII 
characters at the beginning, such as the use of "#!" of at the beginning 
of Unix shell scripts.
>
> BOM is useless in UTF8, no matter what Microsoft thinks.  Any tool that
> relies on it to detect UTF8 data has to have a workaround for overriding
> that detection, or it's broken to the point of uselessness.
This kind of brokenness is currently existing the other way around (see 
my reference to the perl script I' using to work aound it).

Note also that I'm not citing a Microsoft FAQ but the Unicode FAQ.
I'm also not trying to convert Postgres into a Microsoft tool (I'm 
pretty happy it isn't) but I'm pointing to existing compatibility issues 
on a Platform that others have decided to support.
Belonging to the huge group of users who have little or no choice in 
what OS they are using and being from a country where plain ASCII isn't 
enough to cover all existing characters this is probably fair.

It's a pity that the Unicode standard actually allows something that can 
cause problems but blaming the non-platform again doesn't solve the 
existing issues.

Regards,

Brar

In response to

Responses

pgsql-hackers by date

Next:From: Fujii MasaoDate: 2011-09-27 05:51:38
Subject: Re: Online base backup from the hot-standby
Previous:From: Fujii MasaoDate: 2011-09-27 05:00:14
Subject: Re: bug of recovery?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group