Re: another seemingly simple encoding question

From: joseph <kmh496(at)kornet(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: another seemingly simple encoding question
Date: 2006-03-24 14:43:45
Message-ID: 1143211425.25613.0.camel@var.sirfsup.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

problem is that my string -- which is in utf-8 -- because
all input is converted first in php with
$str_out = mb_convert_encoding($str_in, "UTF-8");
and the query, which
is like
"select wordid from korean_english where word='utf8string'";
and it is returning wordids for words which are not = utf8string

(in debug mode) the above is output as UTF-8 by php / browser encoding
over the web, and then "exit;" is called,
so i just grab it from the browser by cutting and pasting the whole
query string.
running the query in php and from psql return the same bad wordids,
pointing that the encoding is consistent through the cut-and-paste
operation.

i don't understand what a "unicode normalization form" is. the postgres
docs http://www.postgresql.org/docs/8.0/interactive/multibyte.html
say

Table 20-1. Server Character Sets

Name
Description

UNICODE Unicode (UTF-8)

so i thought they were the same, and i dont know about "unicode
normalization form".

my question is why isn't the utf8string in query returning only
matching, corresponding wordids from the database....

thx.

2006-03-24 (금), 08:56 -0500, John D. Burger 쓰시길:
> > i have a problem matching a utf8 string with a field in a database
> > encoded in utf8.
>
> You seem to give all the details of your configuration, but unless I
> misread your message, you don't say what the actual problem is. Can
> you provide more details? What exactly doesn't work?
>
> This may not be the issue, but many people don't realize that there
are
> sometimes multiple ways to encode what is conceptually the same
string
> in UTF8 (or any of the Unicode encodings). If you do not
canonicalize
> your strings using one of the Unicode normalization forms, then
> seemingly identical strings may not match, because they are not
> byte-for-byte identical.
>
> - John D. Burger
> MITRE
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message John D. Burger 2006-03-24 14:47:19 Re: another seemingly simple encoding question
Previous Message roman.motyka 2006-03-24 14:38:20 ADO.NET ExecuteReader returns no rows