Re: Encoding problems with migration from 8.0.14 to 8.3.0 on Windows

From: Meetesh Karia <meetesh(dot)karia(at)gmail(dot)com>
To: meetesh(dot)karia(at)alumni(dot)duke(dot)edu
Cc: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-admin(at)postgresql(dot)org
Subject: Re: Encoding problems with migration from 8.0.14 to 8.3.0 on Windows
Date: 2008-03-13 12:01:59
Message-ID: 47D917B7.3050106@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-hackers

Thanks to help from Sergey Benner, I was able to narrow the problem
down. When I first initialized the PG cluster, I used the following
command:

initdb -D d:/postgres/8.3/data -E UTF8 -U postgres

And that set my LC_COLLATE and LC_CTYPE values to 'English_United
States.1252'. 'en_US-UTF-8' would be ideal for me but that doesn't seem
to exist on Windows.

Anyhow, when I recreated the PG cluster using this command:

initdb -D ./data -E UTF8 --locale=C -U postgres

It now correctly handles the values and allows me to create the unique
index. I can use the C locale, but I'm a bit surprised that the default
windows locale wouldn't work.

Is that to be expected?

Thanks,
Meetesh

Meetesh Karia wrote:
> Thanks for your response Robert.
>
> Unfortunately I don't think that will work (unless I'm
> misunderstanding something).
>
> I can't export using a client encoding of WIN1250 because I have true
> UNICODE chars which can't be represented with it (I've tried and I get
> the error below). Therefore, I have a sql dump with UNICODE chars and
> setting the client encoding to WIN1250 when I run the sql dump script
> would toast the characters.
>
> Additionally, here's what I get when I run your test below (my server
> encoding is UTF-8):
>
> ltefull=# create table x (r varchar(255) unique);
> NOTICE: CREATE TABLE / UNIQUE will create implicit index "x_r_key"
> for table "x"
> CREATE TABLE
> ltefull=#
> ltefull=# set client_encoding=WIN1250;
> SET
> ltefull=# insert into x (r) values ('Daniel Brühl');
> INSERT 0 1
> ltefull=#
> ltefull=# insert into x (r) values ('Daniel Bruehl');
> ERROR: duplicate key value violates unique constraint "x_r_key"
>
> Meetesh
>
> Robert Treat wrote:
>> On Wednesday 12 March 2008 09:37, Meetesh Karia wrote:
>>
>>> One quick addition to this:
>>>
>>> The column I'm creating this unique index on is a varchar(255) and the
>>> command I was running was:
>>>
>>> create unique index foo_name on foo (name);
>>>
>>> If I use the following, it now works:
>>>
>>> create unique index foo_name on foo (cast(name as bytea));
>>>
>>> Thoughts?
>>>
>>> Meetesh
>>>
>>> Meetesh Karia wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm trying to migrate from 8.0.14 on Windows (Vista Home Premium) to
>>>> 8.3.0 and I've been trying to solve what appears to be an encoding
>>>> problem. My old db was in the UNICODE encoding. I know that this
>>>> isn't supported on 8.0.x, but it was a restore of a db from a Linux
>>>> environment and postgres didn't appear to have any problems with it.
>>>>
>>>> My 8.3 server and client encodings are UTF8 and I used pg_dumpall (I
>>>> tried the 8.0 and 8.3 versions) to dump the db. However, when I tried
>>>> to restore the db, I got an error during index creation which wouldn't
>>>> let me create a unique index on a column that had all unique values
>>>> (it had the index in 8.0 and a group by having query with no indexes
>>>> on the table confirms uniqueness). The thing that this column does
>>>> have however is values like:
>>>>
>>>> 'Bruehl'
>>>> 'Brühl'
>>>>
>>>> I created a blank table with the unique index on it and inserted rows
>>>> one at a time until I confirmed that it was the above values that were
>>>> causing a problem. Running the following query shows the difference
>>>> in the hex encoded values (I changed my client encoding to WIN1250 to
>>>> get the below to show up correctly):
>>>>
>>>> select name, encode(decode(name, 'escape'), 'hex') from ...
>>>>
>>>> name | encode
>>>> ---------------+----------------------------
>>>> Daniel Brühl | 44616e69656c204272c3bc686c
>>>> Daniel Bruehl | 44616e69656c2042727565686c
>>>> (2 rows)
>>>>
>>>> I've also tried exporting using an encoding of WIN1250 but I get
>>>> errors like this:
>>>>
>>>> pg_dump: Error message from server: ERROR: character 0xc383 of
>>>> encoding "UNICODE" has no equivalent in "WIN1250"
>>>>
>>>> Anyone have any thoughts or suggestions? Why would the index creation
>>>> fail? Is there a workaround?
>>>>
>>>>
>>
>> I'm not convinced your problem isn't solved by proper setting of
>> client_encoding for both input and output:
>>
>> pagila=# create table x (r varchar(255) unique);
>> NOTICE: CREATE TABLE / UNIQUE will create implicit index "x_r_key" for
>> table "x "
>> CREATE TABLE
>> pagila=# set client_encoding=WIN1250;
>> SET
>> pagila=# insert into x (r) values ('Daniel Brühl');
>> INSERT 0 1
>> pagila=# insert into x (r) values ('Daniel Bruehl');
>> INSERT 0 1
>> pagila=# select * from x;
>> r
>> ---------------
>> Daniel Brühl
>> Daniel Bruehl
>> (2 rows)
>>
>>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Julio Leyva 2008-03-13 12:46:56 Re: LAN connection to server only after restart server
Previous Message Meetesh Karia 2008-03-13 10:00:41 Re: Encoding problems with migration from 8.0.14 to 8.3.0 on Windows

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2008-03-13 12:44:37 Re: Reducing Transaction Start/End Contention
Previous Message Pavan Deolasee 2008-03-13 11:40:08 Re: Nasty bug in heap_page_prune