adding collation to a SQL database

From: "Dave Gudeman" <dave(dot)gudeman(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: adding collation to a SQL database
Date: 2008-10-18 09:55:23
Message-ID: 7b079fba0810180255w11f78652l540f04d88e85264d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Normally I would lurk on this list for a month or two before I considered
posting, but I saw a note on collation and thought some of you might be
interested in my experience implementing collation for the ANTs Data Server.

In ADS, we treated the collation internally as part of the type for string
types. In other words, char, varchar, and clob expressions and columns had
not only a length and a can-be-null flag, but also a collation. This let us
do a lot less work to add collations because we didn't have to munge the
expression code very much. We just adapted the code that does type
assignment and then the information was carried through the compiler in the
type info with no further work (except where the compiler actually had to
know about collation). We didn't need to mess with index selection, for
example, because it was handled correctly just based on types. And it let us
add collated indexes very easily just by adding the new type information. We
treated those weird collation expressions like type casts, which made them
easier to implement also.

Collated compares are very slow and it's worth quite a bit of effort to to
avoid them when possible (for example by putting them last in the list of
predicates to evaluate). ICU has a facility for prepocessing search strings
so they can be byte compared (which is much faster than normal collated
compares). You can construct a collated index by using proprocessed search
strings as the keys so that index searches are faster. We didn't do that in
ADS for several reasons: first, the preprocessed strings are bigger (about
30% as I recall) so you can get fewer keys in a node. Second, you don't do
really do a lot of comparisons in an index lookup. Third, ADS had an
optimization where it didn't read the data files if the index contained all
of the columns needed for the query, and that optimization would not work
with these sorts of collation indexes. Still, the parameters of Postgresql
are quite a bit different so it might be worth considering.

Partly for backward compatibility and partly because collated compares are
so slow, ADS had a default collation that was just a normal ascii compare.
Since strings were all encoded in UTF-8, the collation order was incorrect
if there were any multi-byte characters, but in English-only columns, or
columns where they just wanted a reproducible ordering and didn't care that
much about language rules, it was much faster than the ICU compare.

regards,
David Gudeman

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2008-10-18 10:51:20 Re: PGDay.it collation discussion notes
Previous Message Simon Riggs 2008-10-18 08:55:20 Re: Reducing some DDL Locks to ShareLock