Locale, Collation, ICU patch

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Locale, Collation, ICU patch
Date: 2008-04-03 17:54:50
Message-ID: 87hceig6xh.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Regarding the ICU patch in the commitfest here's my plan.

IMHO the idea of making ICU a hard dependency which Postgres will have to use
forevermore on all systems is a non-starter. I'm not entirely against having
ICU as a supported collation system which packagers on systems where the
system locale support is weak can choose to make a dependency of their binary
packages though, assuming the issues raised elsewhere about ICU are resolved.

As long as this bogeyman is scaring us though it's preventing us from having
the SQL standard collation syntax and the accompanying catalog and planner
changes.

And as long as we don't have that support -- which is a big job -- nobody
who's interested in implementing ICU or strcoll_l() or any other interfaces
for a new platform will get around to it. The actual porting glue to call
those functions on each platform is fairly lightweight and could easily be
done by experts on that platform who aren't catalog and planner mavens.

So we have a bit of a chicken and egg problem. We aren't getting the planner
and syntax changes because we aren't sure the support would be good on every
platform and we aren't getting the platform support because we don't have the
planner and catalog changes.

What I want to do is focus on adding the planner and catalog changes somehow.

We implement a kind of baseline locale support something only slightly better
than what we have now using setlocale before every comparison. This is clearly
not the recommended configuration but as long as it handles what we handle
today without a performance hit and a bit more besides it would be a big
start.

I'm assuming we would check if the desired locale is the current locale and
skip the assignment. So if only one locale is *actually* in use then basically
no additional overhead is incurred. Moreover if the desired locale is C then
we can skip the assignment and use strcmp directly. So actually as long as
only one non-C locale is in use then no additional overhead would be incurred.

The big gotcha is what collation to use when comparing with data in the system
tables, especially the shared system tables. I think we do need to define a
database-wide encoding and collation to use for system tables. (Unless we can
get by with varchar_pattern_ops indexes on system tables?)

So the following use cases arise:

a) They're actually using only one collation for both the system tables and
their own data. This is well handled by our existing setup and would be
basically unchanged in the new setup.

b) They're using multiple collations for their data but only one "at a time".
Either one per database or one per session. In which case they don't incur any
overhead

c) They're using multiple collations for their data but only one collation in
a given application unit of work. This is probably the most common case for
OLTP application since each unit of work represents some particular user's
operation. In this case as long as the system tables are set up to use the C
locale then this would require at most one setlocale() call per unit of work
though.

d) They're actively using multiple collations in a single query, possibly even
within a single sort (something like ORDER BY a COLLATION en_US, b COLLATION
es_US). This would perform passably on glibc but abysmally on most other
libc's.

From that point forward we would go about adding support for strcoll_l() and
other interfaces to handle case (d) on various platforms. For platforms with
no reasonable interface we could add a --enable-ICU users or packagers could
choose to use.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Brendan Jurd 2008-04-03 18:04:04 Separate psql commands from arguments (was: psql command aliases support)
Previous Message Aidan Van Dyk 2008-04-03 17:44:34 Re: modules