Re: [pgsql-packagers] Palle Girgensohn's ICU patch

From: Dave Page <dpage(at)pgadmin(dot)org>
To: Jakob Egger <jakob(at)eggerapps(dot)at>
Cc: Geoff Montee <geoff(dot)montee(at)gmail(dot)com>, Palle Girgensohn <girgen(at)pingpong(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, "pgsql-packagers(at)postgresql(dot)org" <pgsql-packagers(at)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Bussmann Tobias <tobias(dot)bussmann(at)scnat(dot)ch>
Subject: Re: [pgsql-packagers] Palle Girgensohn's ICU patch
Date: 2014-11-27 09:15:43
Message-ID: CA+OCxozt5BjdggE9-QnYxCQBROATXQV6ytWES8XPdG77f=5kbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 27, 2014 at 9:09 AM, Jakob Egger <jakob(at)eggerapps(dot)at> wrote:

> Am 26.11.2014 um 17:46 schrieb Geoff Montee <geoff(dot)montee(at)gmail(dot)com>:
> > This topic reminds me of a thread from a couple months ago:
> >
> >
> http://www.postgresql.org/message-id/F8268DB6-B50F-429F-8289-DA8FFA5F22BA@tripadvisor.com
> >
> > It sounds like adding ICU support to core may also allow for adding
> > collation versioning to indexes.
>
> Reading through this thread it becomes clear to me that adding support for
> ICU is more important than I thought, and the only problem is that no one
> has yet volunteered for it :)
>
> I've started looking through the PostgreSQL source and Palle's patch to
> estimate what needs to be done.
>
> MINIMUM TODO
> ============
>
> * Add support for per-column collations in varstr_comp() in varlena.c.
> Currently the patch creates a single ICU collator for the default collation
> and stores it in a static variable. We would need to change this to create
> collators for each collation and store them in a hash table similar to
> pg_newlocale_from_collation() / lookup_collation_cache()
>
> * There's a new feature in trunk for faster sorting using SortSupport, so
> we would also need to also patch bttextfastcmp_locale() in varlena.c
>
> These two changes would allow using ICU for collation. This has two major
> advantages:
> 1) Systems with broken strcoll like OS X and FreeBSD can take advantage of
> ICU to offer proper text sorting
> 2) You can link with a specific version of ICU to avoid index corruption
> and duplicate keys caused by changing implementations of the glibc strcoll
> function
>
>
> NEXT STEPS: Support for more collations
> =======================================
>
> ICU offers a lot more collations than the OS. For example, besides "de_CH"
> it also offers "de_CH(at)collation=phonebook". Adding support for these is a
> bit more involved.
>
> * initdb would need to be extended to also look for collations offered by
> ICU and add them to the pg_collation catalog.
>
> * A special case for LC_COLLATE must be added to check_locale() in the
> backend, get_canonical_locale_name() in pg_upgrade, check_locale_name() in
> initdb to support collations provided by ICU
>
> * pg_perm_setlocale() must get a special case to handle ICU collations
>
> * the local handling code in pgperl must be modified (when using a ICU
> collation as default collation, we must decide what collation to send to
> perl)
>
> * convert_string_datum() in selfuncs.c could be patched to use ICU instead
> of strxfrm. However, as far as I understand, this is not absolutely
> required as this is only used by the query planner and would in the worst
> case prevent some optimisation in corner cases
>
> These changes would probably have an even bigger impact, because then
> people would no longer be limited to the collations supported by the
> locales installed on their OS.
>
> NEXT STEPS: Collation versioning in indices
> ===========================================
>
> Since ICU provides reliable versioning of collations, this would allow us
> to finally prevent index corruption caused by changing implementations of
> strcoll. I haven't looked at this in detail, but I assume that this would
> be a small change with potentially big impact.
>
> Ideally, PostgreSQL would detect when the collation is a different version
> than the one used to create the index, and stop using the index until it is
> rebuilt.
>
>
> I'll take a shot at the MINIMUM TODO as outlined above.
>
>
We've already included ICU support in our Postgres Plus Advanced Server
product. Before you spend too much time on this, give me a few days to see
if we can get that change contributed back. The people I need to speak to
are OOO for Thanksgiving at the moment though, so it may be a few days.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2014-11-27 10:03:31 Re: [v9.5] Custom Plan API
Previous Message Ian Barwick 2014-11-27 09:12:31 pg_regress and --dbname option / multiple databases