Re: COPY command use UTF-8 encoding and NOT

From: Tino Wildenhain <tino(at)wildenhain(dot)de>
To: David Gagnon <dgagnon(at)siunik(dot)com>
Cc: Postgresql-General <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY command use UTF-8 encoding and NOT
Date: 2005-04-06 22:40:42
Message-ID: 1112827242.1387.14.camel@Andrea.peacock.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Am Mittwoch, den 06.04.2005, 18:12 -0400 schrieb David Gagnon:
> Hi all,
>
> I ran into this problem and want to share and have a confirmation.
>
> I tried to use COPY function to load bulk data. I craft myself a
> UNICODE file from a MSSQL db. I can't load it into the postgresql. I
> always get the error: CONTEXT: COPY vd, line 1, column vdnum: "ÿþ1"
>
> The problem is that both file are exactly the same... I found that
> pg_dump create in fact a UTF-8 (Confirm please) file with is UNICODE
> but with variable length encoding (Ie: Some character user 8 bytes and
> other 16 bytes ...). See for detail:
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. The file I crafted
> is a true UNICODE (16 bytes or UCS-2) file (Confirm please)
>
> So here is the content of the file:
> UTF-8 (Postgresql dump):
> 1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ
>
> UNICODE (crafted from mssql)
> 1 1 1 AC COLUMNÿACNUMÿACDESCÿACDELPAIÿ
>
> HEX representation UTF-8 (Postgresql dump):
>
> 00000000:31 09 31 09 31 09 41 43 09 43 4f 4c 55 4d 4e c3
> 1.1.1.AC.COLUMNÃ
> 00000010:bf 41 43 4e 55 4d c3 bf 41 43 44 45 53 43 c3 bf
> ¿ACNUMÿACDESCÿ
> 00000020:41 43 44 45 4c 50 41 49 c3 bf
> ACDELPAIÿ
>
> HEX representation UNICODE (crafted from mssql)
> 00000000:ff fe 31 00 09 00 31 00 09 00 31 00 09 00 41 00
> ÿþ1...1...1...A.
> 00000010:43 00 09 00 43 00 4f 00 4c 00 55 00 4d 00 4e 00
> C...C.O.L.U.M.N.
> 00000020:ff 00 41 00 43 00 4e 00 55 00 4d 00 ff 00 41 00
> ÿ.A.C.N.U.M.ÿ.A.
> 00000030:43 00 44 00 45 00 53 00 43 00 ff 00 41 00 43 00
> C.D.E.S.C.ÿ.A.C.
> 00000040:44 00 45 00 4c 00 50 00 41 00 49 00 ff 00
> D.E.L.P.A.I.ÿ.
>
> So postgresql bug with the FF FE that start the UNICODE document. Is
> that normal UNICODE file starts with this FF FE ?! Note that I tried
> to delete those character but they aren`t visible...
>
> So am I right ? Is Postgresql using UTF-8 and don`t really understand
> UNICODE file (UCS-2)? Is there a way I can make the COPY command with
> a UNICODE UCS-2 encoding

Yes, postgres Unicode means utf-8.
Windows programs which store unicode text are usually prepending
the file with a BOM (byte order mark) (just google for unicode
and BOM)
So you would need something to convert. For example I know python
can read BOM so it can be a matter of open it, read, encode it
into utf-8 and write it out again.

Regards
Tino Wildenhain

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2005-04-06 22:51:21 Re: monitoring database activity on solaris
Previous Message Scott Marlowe 2005-04-06 22:37:34 Re: lower function