Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

From: Palle Girgensohn <girgen(at)pingpong(dot)net>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Devrim Gündüz <devrim(at)gunduz(dot)org>, Jakob Egger <jakob(at)eggerapps(dot)at>, Tobias Bussmann <t(dot)bussmann(at)gmx(dot)net>
Subject: Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date: 2016-08-11 09:15:36
Message-ID: 7D9265F9-2E7B-4136-BCA1-AAE8481561B7@pingpong.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan <pg(at)heroku(dot)com>:
>
> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <girgen(at)pingpong(dot)net> wrote:
>> They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them, but also, I see little point in supporting anything else using ICU.
>
> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
> is not the release that introduced its use, it did expand it
> significantly. I think you need to fix this, even though it isn't
> actually used to sort text at present, since presumably FreeBSD builds
> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
> reasonably trust the ICU equivalent of strxfrm(), so that's a missed
> opportunity. (See the wiki page on the abbreviated keys issue [1] if
> you don't know what I'm talking about.)

My plan was to get it working without TRUST_STRXFRM first, and then add that functinality. I've made some preliminary tests using ICU:s ucol_getSortKey but I will have to test it a bit more. For now, I just expect not to trust strxfrm. It is the first iteration wrt strxfrm, the plan is to use that code base.

>
> Shouldn't you really have a strxfrm() wrapper, used across the board,
> including for callers outside of varlena.c? convert_string_datum() has
> been calling strxfrm() for many releases now. These calls are still
> used in FreeBSD builds, I would think, which seems like a bug that is
> not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption
> that that matching the ordering used elsewhere not really hold on
> FreeBSD builds?

I was not aware of convert_string_datum, I will check that, thanks! Using a wrapper across the board seems like a good idea for refactoring.

>
> [1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue
> --
> Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Artur Zakirov 2016-08-11 09:46:28 Re: Bug in to_timestamp().
Previous Message Marko Tiikkaja 2016-08-11 09:01:18 Re: Assertion failure in REL9_5_STABLE