Re: Best practices for moving UTF8 databases

From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Phoenix Kiula *EXTERN*" <phoenix(dot)kiula(at)gmail(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc: "Andres Freund" <andres(at)anarazel(dot)de>, <pgsql-general(at)postgresql(dot)org>, "Jasen Betts" <jasen(at)xnet(dot)co(dot)nz>
Subject: Re: Best practices for moving UTF8 databases
Date: 2009-07-20 07:34:02
Message-ID: D960CB61B694CF459DCFB4B0128514C203937E48@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Phoenix Kiula wrote:
> Really, PG absolutely needs a way to upgrade the database without so
> much data related downtime and all these silly woes. Several competing
> database systems are a cinch to upgrade.

I'd call it data corruption, not a silly woe.

I know that Oracle for example would not make that much fuss about
your data: they would be imported without even a warning, and
depending on your encoding settings the bad bytes would either be
imported as-is or tacitly changed to inverted (or normal) question
marks.

It's basically a design choice that PostgreSQL made: we think that
an error is preferrable to clandestinely modifying the user's data
or accepting input that cannot possibly make any sense when it is
retrieved at a future time.

> Anyway this is the annoying error I see as always:
>
> ERROR: invalid byte sequence for encoding "UTF8": 0x80
>
> I think my old DB is all utf8. If there are a few characters that are
> not, how can I work with this? I've done everything I can to take care
> of the encoding and such. This code was used to initdb:
>
> initdb --locale=en_US.UTF-8 --encoding=UTF8
>
> Locale environment variables are all "en_US.UTF-8" too.

"0x80" makes me think of the following:
The data originate from a Windows system, where 0x80 is a Euro
sign. Somehow these were imported into PostgreSQL without the
appropriate translation into UTF-8 (how I do not know).

I wonder: why do you spend so much time complaining instead of
simply locating the buggy data and fixing them?

This does not incur any downtime (you can fix the data in the old
database before migrating), and it will definitely enhance the fun
your users have with your database (if they actually see Euros where
they should be).

Yours,
Laurenz Albe

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Albe Laurenz 2009-07-20 08:15:55 Re: Working around spurious unique constraint errors due to SERIALIZABLE bug
Previous Message Karsten Hilbert 2009-07-20 07:00:02 Re: timestamp with time zone tutorial