Quick Links

Re: BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	h8mastre(at)gmail(dot)com
Cc:	pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters
Date:	2018-11-03 04:44:30
Message-ID:	2101.1541220270@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

=?utf-8?q?PG_Bug_reporting_form?= <noreply(at)postgresql(dot)org> writes:
> On Encoding=UTF-8 database, try:
> SELECT show_trgm('123');
> → OK
> SELECT show_trgm('日本語');
> → probably OK.
> SELECT show_trgm('🔍');
> ERROR: invalid multibyte character for locale
> HINT: The server's LC_CTYPE locale is probably incompatible with the
> database encoding.
> SQL state: 22021

I failed to reproduce this on a Linux machine. It looks to me like the
problem is that Windows' MultiByteToWideChar doesn't think that UTF8
character is valid.

> Please check: t_isdigit, t_isspace, t_isalpha, and t_isprint.
> https://github.com/postgres/postgres/blob/322548a8abe225f2cfd6a48e07b99e2711d28ef7/src/backend/tsearch/ts_locale.c#L35
> char2wchar 4th parameter should take number of input bytes. However they
> pass character count.
> int clen = pg_mblen(ptr);
> ...
> char2wchar(character, 2, ptr, clen, mylocale);

Huh? pg_mblen returns the number of bytes in a multibyte character,
so this looks fine to me.

regards, tom lane

In response to

BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters at 2018-11-01 02:39:20 from PG Bug reporting form

Responses

Re: BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters at 2018-11-03 17:03:20 from Tom Lane

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Andres Freund	2018-11-03 05:57:00	Re: Wrong aggregate result when sorting by a NULL value
Previous Message	tianhe zh	2018-11-03 02:04:31	关于 sqlda.pgc 中 numeric 数据类型精度丢失问题报告