Quick Links

BUG #2010: COPY command does not recognise UTF-8 text files with leading BOM

From:	"Roddi Walker" <roddiwalker(at)yahoo(dot)com>
To:	pgsql-bugs(at)postgresql(dot)org
Subject:	BUG #2010: COPY command does not recognise UTF-8 text files with leading BOM
Date:	2005-10-31 02:34:00
Message-ID:	20051031023400.9563CF0BAB@svr2.postgresql.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

The following bug has been logged online:

Bug reference: 2010
Logged by: Roddi Walker
Email address: roddiwalker(at)yahoo(dot)com
PostgreSQL version: 8.1 beta 4
Operating system: Win 2000 Professional
Description: COPY command does not recognise UTF-8 text files with
leading BOM
Details:

1) Created a UTF-8 database "foo", with a table "bar":
CREATE TABLE bar ( mycol text );

2) Used Notepad created a UTF-8 "bar.txt" text file with just the word
"fred" in it.
When writing a UTF-8 file, Notepad writes a 3-byte Byte Order Mark (BOM)
header of hex EF BB BF.
So the file's 7 hex bytes were:
EF BB BF 66 72 65 64.

This BOM header is legal - see http://www.unicode.org/faq/utf_bom.html#BOM -
but probably used only on Windows.

3) in PSQL, populated table "bar" from file "bar.txt" using:
copy bar from 'c:\\bar.txt';

4) THE BUG: postgresql doesn't recognise the EF BB BF bytes as a BOM header
and skip it.
Instead it treats the 3 bytes as a unicode character which pgAdminIII
renders as a hollow square when the table data is viewed.
That is, table data rendered as "[]fred" (where "[]" is the hollow box).

5) SUGGESTED SOLUTION: I'm not a unicode expert, so I don't know if the BOM
can be safely skipped in all cases (although it probably can for UFT-8 text
files).
But at least a COPY option SKIPBOM (or some-such).

Responses

Re: BUG #2010: COPY command does not recognise UTF-8 text files with leading BOM at 2005-11-02 01:22:46 from Alvaro Herrera

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Tomas Zerolo	2005-10-31 04:39:56	Re: importing from 8.0.3 unicode problem
Previous Message	Theodore Petrosky	2005-10-30 20:23:00	importing from 8.0.3 unicode problem