This page in other versions: 9.2 / 9.3 / 9.4 / 9.5 / current (9.6)  |  Development versions: devel / 10  |  Unsupported versions: 9.1

23.2. Collation Support

The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. This alleviates the restriction that the LC_COLLATE and LC_CTYPE settings of a database cannot be changed after its creation.

23.2.1. Concepts

Conceptually, every expression of a collatable data type has a collation. (The built-in collatable data types are text, varchar, and char. User-defined base types can also be marked collatable, and of course a domain over a collatable data type is collatable.) If the expression is a column reference, the collation of the expression is the defined collation of the column. If the expression is a constant, the collation is the default collation of the data type of the constant. The collation of a more complex expression is derived from the collations of its inputs, as described below.

The collation of an expression can be the default collation, which means the locale settings defined for the database. It is also possible for an expression's collation to be indeterminate. In such cases, ordering operations and other operations that need to know the collation will fail.

When the database system has to perform an ordering or a character classification, it uses the collation of the input expression. This happens, for example, with ORDER BY clauses and function or operator calls such as <. The collation to apply for an ORDER BY clause is simply the collation of the sort key. The collation to apply for a function or operator call is derived from the arguments, as described below. In addition to comparison operators, collations are taken into account by functions that convert between lower and upper case letters, such as lower, upper, and initcap; by pattern matching operators; and by to_char and related functions.

For a function or operator call, the collation that is derived by examining the argument collations is used at run time for performing the specified operation. If the result of the function or operator call is of a collatable data type, the collation is also used at parse time as the defined collation of the function or operator expression, in case there is a surrounding expression that requires knowledge of its collation.

The collation derivation of an expression can be implicit or explicit. This distinction affects how collations are combined when multiple different collations appear in an expression. An explicit collation derivation occurs when a COLLATE clause is used; all other collation derivations are implicit. When multiple collations need to be combined, for example in a function call, the following rules are used:

  1. If any input expression has an explicit collation derivation, then all explicitly derived collations among the input expressions must be the same, otherwise an error is raised. If any explicitly derived collation is present, that is the result of the collation combination.

  2. Otherwise, all input expressions must have the same implicit collation derivation or the default collation. If any non-default collation is present, that is the result of the collation combination. Otherwise, the result is the default collation.

  3. If there are conflicting non-default implicit collations among the input expressions, then the combination is deemed to have indeterminate collation. This is not an error condition unless the particular function being invoked requires knowledge of the collation it should apply. If it does, an error will be raised at run-time.

For example, consider this table definition:

    a text COLLATE "de_DE",
    b text COLLATE "es_ES",

Then in

SELECT a < 'foo' FROM test1;

the < comparison is performed according to de_DE rules, because the expression combines an implicitly derived collation with the default collation. But in

SELECT a < ('foo' COLLATE "fr_FR") FROM test1;

the comparison is performed using fr_FR rules, because the explicit collation derivation overrides the implicit one. Furthermore, given

SELECT a < b FROM test1;

the parser cannot determine which collation to apply, since the a and b columns have conflicting implicit collations. Since the < operator does need to know which collation to use, this will result in an error. The error can be resolved by attaching an explicit collation specifier to either input expression, thus:

SELECT a < b COLLATE "de_DE" FROM test1;

or equivalently

SELECT a COLLATE "de_DE" < b FROM test1;

On the other hand, the structurally similar case

SELECT a || b FROM test1;

does not result in an error, because the || operator does not care about collations: its result is the same regardless of the collation.

The collation assigned to a function or operator's combined input expressions is also considered to apply to the function or operator's result, if the function or operator delivers a result of a collatable data type. So, in

SELECT * FROM test1 ORDER BY a || 'foo';

the ordering will be done according to de_DE rules. But this query:

SELECT * FROM test1 ORDER BY a || b;

results in an error, because even though the || operator doesn't need to know a collation, the ORDER BY clause does. As before, the conflict can be resolved with an explicit collation specifier:

SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";

23.2.2. Managing Collations

A collation is an SQL schema object that maps an SQL name to locales provided by libraries installed in the operating system. A collation definition has a provider that specifies which library supplies the locale data. One standard provider name is libc, which uses the locales provided by the operating system C library. These are the locales that most tools provided by the operating system use. Another provider is icu, which uses the external ICU library. ICU locales can only be used if support for ICU was configured when PostgreSQL was built.

A collation object provided by libc maps to a combination of LC_COLLATE and LC_CTYPE settings. (As the name would suggest, the main purpose of a collation is to set LC_COLLATE, which controls the sort order. But it is rarely necessary in practice to have an LC_CTYPE setting that is different from LC_COLLATE, so it is more convenient to collect these under one concept than to create another infrastructure for setting LC_CTYPE per expression.) Also, a libc collation is tied to a character set encoding (see Section 23.3). The same collation name may exist for different encodings.

A collation object provided by icu maps to a named collator provided by the ICU library. ICU does not support separate collate and ctype settings, so they are always the same. Also, ICU collations are independent of the encoding, so there is always only one ICU collation of a given name in a database. Standard Collations

On all platforms, the collations named default, C, and POSIX are available. Additional collations may be available depending on operating system support. The default collation selects the LC_COLLATE and LC_CTYPE values specified at database creation time. The C and POSIX collations both specify traditional C behavior, in which only the ASCII letters A through Z are treated as letters, and sorting is done strictly by character code byte values.

Additionally, the SQL standard collation name ucs_basic is available for encoding UTF8. It is equivalent to C and sorts by Unicode code point. Predefined Collations

If the operating system provides support for using multiple locales within a single program (newlocale and related functions), or if support for ICU is configured, then when a database cluster is initialized, initdb populates the system catalog pg_collation with collations based on all the locales it finds in the operating system at the time.

To inspect the currently available locales, use the query SELECT * FROM pg_collation, or the command \dOS+ in psql. libc collations

For example, the operating system might provide a locale named de_DE.utf8. initdb would then create a collation named de_DE.utf8 for encoding UTF8 that has both LC_COLLATE and LC_CTYPE set to de_DE.utf8. It will also create a collation with the .utf8 tag stripped off the name. So you could also use the collation under the name de_DE, which is less cumbersome to write and makes the name less encoding-dependent. Note that, nevertheless, the initial set of collation names is platform-dependent.

The default set of collations provided by libc map directly to the locales installed in the operating system, which can be listed using the command locale -a. In case a libc collation is needed that has different values for LC_COLLATE and LC_CTYPE, or if new locales are installed in the operating system after the database system was initialized, then a new collation may be created using the CREATE COLLATION command. New operating system locales can also be imported en masse using the pg_import_system_collations() function.

Within any particular database, only collations that use that database's encoding are of interest. Other entries in pg_collation are ignored. Thus, a stripped collation name such as de_DE can be considered unique within a given database even though it would not be unique globally. Use of the stripped collation names is recommended, since it will make one less thing you need to change if you decide to change to another database encoding. Note however that the default, C, and POSIX collations can be used regardless of the database encoding.

PostgreSQL considers distinct collation objects to be incompatible even when they have identical properties. Thus for example,


will draw an error even though the C and POSIX collations have identical behaviors. Mixing stripped and non-stripped collation names is therefore not recommended. ICU collations

Collations provided by ICU are created with names in BCP 47 language tag format, with a private use extension -x-icu appended, to distinguish them from libc locales. So de-x-icu would be an example name.

With ICU, it is not sensible to enumerate all possible locale names. ICU uses a particular naming system for locales, but there are many more ways to name a locale than there are actually distinct locales. (In fact, any string will be accepted as a locale name.) See for information on ICU locale naming. initdb uses the ICU APIs to extract a set of distinct locales to populate the initial set of collations. Here are some example collations that might be created:


German collation, default variant


German collation for Austria, default variant

(There are also, say, de-DE-x-icu or de-CH-x-icu, but as of this writing, they are equivalent to de-x-icu.)

und-x-icu (for undefined)

ICU root collation. Use this to get a reasonable language-agnostic sort order.

Some (less frequently used) encodings are not supported by ICU. When the database encoding is one of these, ICU collation entries in pg_collation are ignored. Attempting to use one will draw an error along the lines of collation "de-x-icu" for encoding "WIN874" does not exist.

ICU allows collations to be customized beyond the basic language+country set that is preloaded by initdb. Users are encouraged to define their own collation objects that make use of these facilities to suit the sorting behavior to their requirements. Here are some examples:

CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk')

German collation with phone book collation type

CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji')

Root collation with Emoji collation type, per Unicode Technical Standard #51

CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit')

Sort digits after Latin letters. (The default is digits before letters.)

CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper')

Sort upper-case letters before lower-case letters. (The default is lower-case letters first.)

CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-latn-digit')

Combines both of the above options.

CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true')

Numeric ordering, sorts sequences of digits by their numeric value, for example: A-21 < A-123 (also known as natural sort).

See Unicode Technical Standard #35 and BCP 47 for details. The list of possible collation types (co subtag) can be found in the CLDR repository. The ICU Locale Explorer can be used to check the details of a particular locale definition.

Note that while this system allows creating collations that ignore case or ignore accents or similar (using the ks key), PostgreSQL does not at the moment allow such collations to act in a truly case- or accent-insensitive manner. Any strings that compare equal according to the collation but are not byte-wise equal will be sorted according to their byte values. Copying Collations

The command CREATE COLLATION can also be used to create a new collation from an existing collation, which can be useful to be able to use operating-system-independent collation names in applications, create compatibility names, or use an ICU-provided collation under a more readable name. For example:

CREATE COLLATION french FROM "fr-x-icu";

The standard and predefined collations are in the schema pg_catalog, like all predefined objects. User-defined collations should be created in user schemas. This also ensures that they are saved by pg_dump.

Submit correction

If you see anything in the documentation that is not correct, does not match your experience with the particular feature or requires further clarification, please use this form to report a documentation issue.

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group