Re: [pgsql-packagers] Palle Girgensohn's ICU patch

From: Jakob Egger <jakob(at)eggerapps(dot)at>
To: Geoff Montee <geoff(dot)montee(at)gmail(dot)com>
Cc: Palle Girgensohn <girgen(at)pingpong(dot)net>, Dave Page <dpage(at)postgresql(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>, "pgsql-packagers(at)postgresql(dot)org" <pgsql-packagers(at)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Bussmann Tobias <tobias(dot)bussmann(at)scnat(dot)ch>
Subject: Re: [pgsql-packagers] Palle Girgensohn's ICU patch
Date: 2014-11-27 09:09:38
Message-ID: 72AE2E04-CD4E-4E7A-9303-49DE5354B4B3@eggerapps.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Am 26.11.2014 um 17:46 schrieb Geoff Montee <geoff(dot)montee(at)gmail(dot)com>:
> This topic reminds me of a thread from a couple months ago:
>
> http://www.postgresql.org/message-id/F8268DB6-B50F-429F-8289-DA8FFA5F22BA@tripadvisor.com
>
> It sounds like adding ICU support to core may also allow for adding
> collation versioning to indexes.

Reading through this thread it becomes clear to me that adding support for ICU is more important than I thought, and the only problem is that no one has yet volunteered for it :)

I've started looking through the PostgreSQL source and Palle's patch to estimate what needs to be done.

MINIMUM TODO
============

* Add support for per-column collations in varstr_comp() in varlena.c. Currently the patch creates a single ICU collator for the default collation and stores it in a static variable. We would need to change this to create collators for each collation and store them in a hash table similar to pg_newlocale_from_collation() / lookup_collation_cache()

* There's a new feature in trunk for faster sorting using SortSupport, so we would also need to also patch bttextfastcmp_locale() in varlena.c

These two changes would allow using ICU for collation. This has two major advantages:
1) Systems with broken strcoll like OS X and FreeBSD can take advantage of ICU to offer proper text sorting
2) You can link with a specific version of ICU to avoid index corruption and duplicate keys caused by changing implementations of the glibc strcoll function

NEXT STEPS: Support for more collations
=======================================

ICU offers a lot more collations than the OS. For example, besides "de_CH" it also offers "de_CH(at)collation=phonebook". Adding support for these is a bit more involved.

* initdb would need to be extended to also look for collations offered by ICU and add them to the pg_collation catalog.

* A special case for LC_COLLATE must be added to check_locale() in the backend, get_canonical_locale_name() in pg_upgrade, check_locale_name() in initdb to support collations provided by ICU

* pg_perm_setlocale() must get a special case to handle ICU collations

* the local handling code in pgperl must be modified (when using a ICU collation as default collation, we must decide what collation to send to perl)

* convert_string_datum() in selfuncs.c could be patched to use ICU instead of strxfrm. However, as far as I understand, this is not absolutely required as this is only used by the query planner and would in the worst case prevent some optimisation in corner cases

These changes would probably have an even bigger impact, because then people would no longer be limited to the collations supported by the locales installed on their OS.

NEXT STEPS: Collation versioning in indices
===========================================

Since ICU provides reliable versioning of collations, this would allow us to finally prevent index corruption caused by changing implementations of strcoll. I haven't looked at this in detail, but I assume that this would be a small change with potentially big impact.

Ideally, PostgreSQL would detect when the collation is a different version than the one used to create the index, and stop using the index until it is rebuilt.

I'll take a shot at the MINIMUM TODO as outlined above.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ian Barwick 2014-11-27 09:12:31 pg_regress and --dbname option / multiple databases
Previous Message Stephen Frost 2014-11-27 07:03:05 Re: copy.c handling for RLS is insecure