Re: COPY command character set

From: "Peter Headland" <pheadland(at)actuate(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY command character set
Date: 2009-09-10 18:28:31
Message-ID: 71F491F5DA99604A80DE49424BF3D02B0CD9A27B@exchange8.actuate.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> There are no lead bytes in UTF-8

Sorry, sloppy use of terminology. I should have said "UTF signatures"
aka the "byte order mark". IOW, the "magic number" bytes commonly found
at the front of UTF encoded files:

UTF-16 little-endian FF FE
UTF-16 big-endian FE FF
UTF-8 EF BB BF

These tend to be inserted automatically by text editors, so it would be
advantageous to have them handled automatically by COPY (at least as an
option). Right now, if I edit a UTF-8 file then load it with COPY, I get
errors or bad data if the editor chose to add the 3 signature bytes.

Whilst UTF-16 is not supported internally, COPY seems to be a legitimate
special case, because it is used for migration to/from other tools that
may emit or expect UTF-16. ISTR that Postgres uses UCI? If so it would
be near-trivial to allow COPY to read and write UTF-16. If done via a
syntax extension to COPY (which I think is the most desirable
implementation), this would have no adverse effect on any other
capability. It also seems sufficiently isolated from sensitive/complex
areas of the code that it might make a suitable first project for
someone who is interested in becoming a contributor...

--
Peter Headland
Architect
Actuate Corporation

-----Original Message-----
From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
Sent: Thursday, September 10, 2009 11:13
To: Peter Headland
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] COPY command character set

"Peter Headland" <pheadland(at)actuate(dot)com> writes:
> How about my suggestion to add a means (extend COPY syntax) to specify
> encoding explicitly and handle UTF lead bytes - would that be of
> interest?

There are no lead bytes in UTF-8, and we make no pretense of handling
UTF-16, so I don't think we'd be interested in some hack that cleans
up misencoding problems.

The idea of overriding client_encoding has been suggested before. I
don't remember if it was rejected or is just languishing on the TODO
list. I'd be a little worried about sending clients data in an encoding
they aren't expecting, but if it only works for I/O to a file it might
be okay.

regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Peter Headland 2009-09-10 18:31:12 Re: COPY command character set
Previous Message Scott Bailey 2009-09-10 18:28:13 Getting the oid of an anyelement