Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

From: Palle Girgensohn <girgen(at)pingpong(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Devrim Gündüz <devrim(at)gunduz(dot)org>, Jakob Egger <jakob(at)eggerapps(dot)at>, Tobias Bussmann <t(dot)bussmann(at)gmx(dot)net>
Subject: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date: 2016-08-10 20:42:01
Message-ID: A4DB6CD4-F4CC-4C48-A9DC-DCBDCBD51186@pingpong.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian <bruce(at)momjian(dot)us>:
>
> On Thu, Aug 4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
>> Yep, it does. But we've made little to no progress on integration of ICU
>> support and AFAIK nobody's working on it right now.
>
> Uh, this email from July says Peter Eisentraut will submit it in
> September :-)
>
> https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com

Cool.

I have brushed up my decade+ old patches [1] for ICU, so they now have support for COLLATE on columns.

https://github.com/girgen/postgres/

in branches icu/XXX where XXX is master or REL9_X_STABLE.

They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them, but also, I see little point in supporting anything else using ICU.

I have one question for someone with knowledge about Turkish (Devrim?). This is the diff from regression tests, when running

$ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8

$ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs"
*** /Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out 2016-08-10 21:09:03.000000000 +0200
--- /Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out 2016-08-10 21:12:53.000000000 +0200
***************
*** 373,379 ****
SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
false
-------
! f
(1 row)

SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
--- 373,379 ----
SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
false
-------
! t
(1 row)

SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
***************
*** 385,391 ****
SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
true
------
! t
(1 row)

-- The following actually exercises the selectivity estimation for ~*.
--- 385,391 ----
SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
true
------
! f
(1 row)

-- The following actually exercises the selectivity estimation for ~*.

======================================================================

The Linux locale behaves differently from ICU for the above (corner ?) cases. Any ideas if one is more correct than the other? I seems unclear to me. Perhaps it depends on whether the case-insensitive match is done using lower(both) or upper(both)? I haven't investigated this yet. @Devrim, is one more correct than the other?

As Thomas points out, using ucoll_strcoll it is quick, since no copying is needed. I will get some benchmarks soon.

Palle

[1] https://people.freebsd.org/~girgen/postgresql-icu/README.html

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2016-08-10 20:44:12 Re: new pgindent run before branch?
Previous Message Robert Haas 2016-08-10 20:39:00 Re: Wait events monitoring future development