回复:`pg_trgm` not recognizing Chinese characters in macOS

From: 周正中(德歌) <dege(dot)zzz(at)alibaba-inc(dot)com>
To: "Haotian Yang" <yangnw(at)live(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: 回复:`pg_trgm` not recognizing Chinese characters in macOS
Date: 2018-09-12 05:02:48
Message-ID: 31ad828c-7926-41d7-b54e-6d3c79cc2a03.dege.zzz@alibaba-inc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

you should use lc_ctype not to C.

```
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+------------+------------+-----------------------
newdb | postgres | UTF8 | en_US.UTF8 | en_US.UTF8 |
postgres | postgres | UTF8 | en_US.UTF8 | en_US.UTF8 |
template0 | postgres | UTF8 | en_US.UTF8 | en_US.UTF8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF8 | en_US.UTF8 | =c/postgres +
| | | | | postgres=CTc/postgres
(4 rows)

postgres=# select show_trgm('hello你好');
show_trgm
------------------------------------------------------
{0xcf7970,0xfe5170,0x114ebf," h"," he",ell,hel,llo}
(1 row)

postgres=# create database testdb with template template0 lc_ctype='C';
CREATE DATABASE
postgres=# \c testdb
You are now connected to database "testdb" as user "postgres".
testdb=# create extension pg_trgm;
CREATE EXTENSION
testdb=# select show_trgm('hello你好');
show_trgm
---------------------------------
{" h"," he",ell,hel,llo,"lo "}
(1 row)
```
------------------------------------------------------------------
发件人:Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
发送时间:2018年9月11日(星期二) 21:20
收件人:Haotian Yang <yangnw(at)live(dot)com>
抄 送:pgsql-bugs(at)postgresql(dot)org <pgsql-bugs(at)postgresql(dot)org>
主 题:Re: `pg_trgm` not recognizing Chinese characters in macOS

Haotian Yang <yangnw(at)live(dot)com> writes:
> Versions: macOS 10.13.6, PostgreSQL 10.5, pg_trgm 1.3.
> LC_ALL=en_US.UTF-8

pg_trgm relies on libc's functions (specifically, iswalpha()) to determine
what is a word character or not. Unfortunately, the UTF8 locale support
in macOS is pretty incomplete, and I don't find it too surprising that
it's not recognizing Chinese characters as alphabetic. Now, you could
make a good argument that they *shouldn't* be considered alphabetic in
an en_US locale; but I'm unsure whether switching to a more appropriate
locale will help.

Anyway, I'd first try zh_CN.UTF-8, and if that doesn't fix it, the place
to complain is https://bugreport.apple.com/ ... I'm sure they know about
it already, but the number of reports has an impact on how fast they
fix things.

regards, tom lane

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Mareks Kalnačs 2018-09-12 08:31:50 PostgreSQL 10.0 SELECT LIMIT performance problem
Previous Message Tom Lane 2018-09-12 03:30:40 Re: BUG #15380: Sorting paging data loss