Re: Support UTF-8 files with BOM in COPY FROM

From: Brar Piening <brar(at)gmx(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, david(at)kineticode(dot)com, itagaki(dot)takahiro(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Support UTF-8 files with BOM in COPY FROM
Date: 2011-09-26 19:21:14
Message-ID: 4E80D0AA.4080906@gmx.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:
> The thing that makes me doubt that is this comment from Tatsuo Ishii:
>
> TI> COPY explicitly specifies the encoding (to be UTF-8 in this case). So
> TI> I think we should not regard U+FEFF as "BOM" in COPY, rather we should
> TI> regard U+FEFF as "ZERO WIDTH NO-BREAK SPACE".
>
> If a BOM is confusable with valid data, then I think recognizing it
> and discarding it unconditionally is no good - you could end up where
> COPY OUT, TRUNCATE, COPY IN changes the table contents.

Citing from the Unicode FAQ again:

Q: Where is a BOM useful?
A: A BOM is useful at the beginning of files that are typed as text, but
for which it is not known whether they are in big or little endian
format—it can also serve as a hint indicating that the file is in
Unicode, as opposed to in a legacy encoding and furthermore, it act as a
signature for the specific encoding form used.

I think that the major hint in the answer is "beginning of files".

To correctly handle a BOM you need to be sure to be in the context of
files that have defined bounds (especially a *beginning*) you can't
properly handle a BOM in arbitrary streams.

Regards,

Brar

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kohei KaiGai 2011-09-26 19:23:55 Re: [v9.2] Fix Leaky View Problem
Previous Message Brar Piening 2011-09-26 19:11:53 Re: Support UTF-8 files with BOM in COPY FROM