Re: [pgsql-packagers] Palle Girgensohn's ICU patch

From: Palle Girgensohn <girgen(at)pingpong(dot)net>
To: Dave Page <dpage(at)pgadmin(dot)org>
Cc: Jakob Egger <jakob(at)eggerapps(dot)at>, Geoff Montee <geoff(dot)montee(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, "pgsql-packagers(at)postgresql(dot)org" <pgsql-packagers(at)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Bussmann Tobias <tobias(dot)bussmann(at)scnat(dot)ch>
Subject: Re: [pgsql-packagers] Palle Girgensohn's ICU patch
Date: 2015-04-19 11:46:28
Message-ID: 0ECFF0FA-2D9C-46D4-BEF8-34C7A5215FDC@pingpong.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> 27 nov 2014 kl. 10:15 skrev Dave Page <dpage(at)pgadmin(dot)org>:
>
>
>
> On Thu, Nov 27, 2014 at 9:09 AM, Jakob Egger <jakob(at)eggerapps(dot)at> wrote:
> Am 26.11.2014 um 17:46 schrieb Geoff Montee <geoff(dot)montee(at)gmail(dot)com>:
> > This topic reminds me of a thread from a couple months ago:
> >
> > http://www.postgresql.org/message-id/F8268DB6-B50F-429F-8289-DA8FFA5F22BA@tripadvisor.com
> >
> > It sounds like adding ICU support to core may also allow for adding
> > collation versioning to indexes.
>
> Reading through this thread it becomes clear to me that adding support for ICU is more important than I thought, and the only problem is that no one has yet volunteered for it :)
>
> I've started looking through the PostgreSQL source and Palle's patch to estimate what needs to be done.
>
> MINIMUM TODO
> ============
>
> * Add support for per-column collations in varstr_comp() in varlena.c. Currently the patch creates a single ICU collator for the default collation and stores it in a static variable. We would need to change this to create collators for each collation and store them in a hash table similar to pg_newlocale_from_collation() / lookup_collation_cache()
>
> * There's a new feature in trunk for faster sorting using SortSupport, so we would also need to also patch bttextfastcmp_locale() in varlena.c
>
> These two changes would allow using ICU for collation. This has two major advantages:
> 1) Systems with broken strcoll like OS X and FreeBSD can take advantage of ICU to offer proper text sorting
> 2) You can link with a specific version of ICU to avoid index corruption and duplicate keys caused by changing implementations of the glibc strcoll function
>
>
> NEXT STEPS: Support for more collations
> =======================================
>
> ICU offers a lot more collations than the OS. For example, besides "de_CH" it also offers "de_CH(at)collation=phonebook". Adding support for these is a bit more involved.
>
> * initdb would need to be extended to also look for collations offered by ICU and add them to the pg_collation catalog.
>
> * A special case for LC_COLLATE must be added to check_locale() in the backend, get_canonical_locale_name() in pg_upgrade, check_locale_name() in initdb to support collations provided by ICU
>
> * pg_perm_setlocale() must get a special case to handle ICU collations
>
> * the local handling code in pgperl must be modified (when using a ICU collation as default collation, we must decide what collation to send to perl)
>
> * convert_string_datum() in selfuncs.c could be patched to use ICU instead of strxfrm. However, as far as I understand, this is not absolutely required as this is only used by the query planner and would in the worst case prevent some optimisation in corner cases
>
> These changes would probably have an even bigger impact, because then people would no longer be limited to the collations supported by the locales installed on their OS.
>
> NEXT STEPS: Collation versioning in indices
> ===========================================
>
> Since ICU provides reliable versioning of collations, this would allow us to finally prevent index corruption caused by changing implementations of strcoll. I haven't looked at this in detail, but I assume that this would be a small change with potentially big impact.
>
> Ideally, PostgreSQL would detect when the collation is a different version than the one used to create the index, and stop using the index until it is rebuilt.
>
>
> I'll take a shot at the MINIMUM TODO as outlined above.
>
>
> We've already included ICU support in our Postgres Plus Advanced Server product. Before you spend too much time on this, give me a few days to see if we can get that change contributed back. The people I need to speak to are OOO for Thanksgiving at the moment though, so it may be a few days.
>
> --

Hi,

Just poking this old thread again. What happened here, is anyone putting work into this area at the moment?

Palle

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-04-19 12:00:49 RewindTest.pm missing --debug for remote mode in pg_rewind tests
Previous Message Amit Kapila 2015-04-19 03:59:32 Re: deparsing utility commands