From: | Kiran K V <kirankv(dot)1982(at)gmail(dot)com> |
---|---|
To: | Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> |
Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: Issue with loading unicode characters with copy command |
Date: | 2024-01-12 17:47:27 |
Message-ID: | CAGv0ivpMT8cfK_2fbPzL3aPmd=UCps7uQjom84BHM6KL8fSD4g@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Its UTF-8. Also verified the load file and its utf-8.
Regards,
Kiran
On Fri, Jan 12, 2024 at 10:48 PM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
wrote:
> On 1/12/24 07:23, Kiran K V wrote:
> > Hi,
> >
> >
> > I have a UTF8 database and simple table with two columns (integer and
> > varchar). Created a csv file with some multibyte characters and trying
> > to perform load operation using the copy command.
>
> The multibyte characters come from what character set?
>
> >
> > __ __
> >
> > Database info:____
> >
> > Postgresql database details:____
> >
> > Name | Owner | Encoding | Collate |
> > Ctype | Access privileges____
> >
> >
> -----------+----------+----------+--------------------+--------------------+-----------------------____
> >
> > postgres | postgres | UTF8 | English_India.1252 |
> > English_India.1252 |____
> >
> > __ __
> >
> > (Note: I also tried with collate utf8 and no luck)
> >
> >
> > postgres=# set client_encoding='UTF8';____
> >
> > SET____
> >
> > __ __
> >
> > Table:____
> >
> > create table public.test ( PKCOL integer not null, STR1 character
> > varying(64) null, primary key( PKCOL )) ____
> >
> > ____
> >
> > csv contents:____
> >
> > 1|"àáâãäåæçèéêëìíîï"____
> >
> > __ __
> >
> > After data loading, actual data is becoming____
> >
> > à áâãäåæçèéêëìÃîï____
> >
> > hex of this is -
> > c2a1c2a2c2a3c2a4c2a5c2a6c2a7c2a8c2a9c2aac2abc2acc2aec2af____
> >
> > __ __
> >
> > The hex values are indeed the UTF-8 encodings of the characters in your
> > expected string, and the presence of `C2` before each character is
> > indicative of how UTF-8 represents certain characters.____
> >
> > In UTF-8, characters from the extended Latin set (like `à`, `á`, `â`,
> > etc.) are represented as two bytes. The first byte `C2` or `C3`
> > indicates that this is a two-byte character, and the second byte
> > specifies the character. For example:____
> >
> > - `à` is represented as `C3 A0`____
> >
> > - `á` is `C3 A1`____
> >
> > - `â` is `C3 A2`, and so on.____
> >
> > In this case, the `C2` byte is getting interpreted as a separate
> > character and that is the likely reason that an `Â` (which corresponds
> > to `C2`) is seen before each intended character. Looks like UTF-8
> > encoded data is mistakenly interpreted as Latin-1 (ISO-8859-1) or
> > Windows-1252, where each byte is treated as a separate character.
> >
> >
> > Please advise. Thank you very much.
> >
> >
> > Regards,
> >
> > Kiran
> >
>
> --
> Adrian Klaver
> adrian(dot)klaver(at)aklaver(dot)com
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Jim Nasby | 2024-01-12 19:11:05 | Re: [EXTERNAL]Re: Refresh Materialized View Issue |
Previous Message | Adrian Klaver | 2024-01-12 17:32:53 | Re: Time zone offset in to_char() |