The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. This alleviates the restriction that the
LC_CTYPE settings of a database cannot be changed after its creation.
Conceptually, every expression of a collatable data type has a collation. (The built-in collatable data types are
char. User-defined base types can also be marked collatable, and of course a domain over a collatable data type is collatable.) If the expression is a column reference, the collation of the expression is the defined collation of the column. If the expression is a constant, the collation is the default collation of the data type of the constant. The collation of a more complex expression is derived from the collations of its inputs, as described below.
The collation of an expression can be the “default” collation, which means the locale settings defined for the database. It is also possible for an expression's collation to be indeterminate. In such cases, ordering operations and other operations that need to know the collation will fail.
When the database system has to perform an ordering or a character classification, it uses the collation of the input expression. This happens, for example, with
ORDER BY clauses and function or operator calls such as
<. The collation to apply for an
ORDER BY clause is simply the collation of the sort key. The collation to apply for a function or operator call is derived from the arguments, as described below. In addition to comparison operators, collations are taken into account by functions that convert between lower and upper case letters, such as
initcap; by pattern matching operators; and by
to_char and related functions.
For a function or operator call, the collation that is derived by examining the argument collations is used at run time for performing the specified operation. If the result of the function or operator call is of a collatable data type, the collation is also used at parse time as the defined collation of the function or operator expression, in case there is a surrounding expression that requires knowledge of its collation.
The collation derivation of an expression can be implicit or explicit. This distinction affects how collations are combined when multiple different collations appear in an expression. An explicit collation derivation occurs when a
COLLATE clause is used; all other collation derivations are implicit. When multiple collations need to be combined, for example in a function call, the following rules are used:
If any input expression has an explicit collation derivation, then all explicitly derived collations among the input expressions must be the same, otherwise an error is raised. If any explicitly derived collation is present, that is the result of the collation combination.
Otherwise, all input expressions must have the same implicit collation derivation or the default collation. If any non-default collation is present, that is the result of the collation combination. Otherwise, the result is the default collation.
If there are conflicting non-default implicit collations among the input expressions, then the combination is deemed to have indeterminate collation. This is not an error condition unless the particular function being invoked requires knowledge of the collation it should apply. If it does, an error will be raised at run-time.
For example, consider this table definition:
CREATE TABLE test1 ( a text COLLATE "de_DE", b text COLLATE "es_ES", ... );
SELECT a < 'foo' FROM test1;
< comparison is performed according to
de_DE rules, because the expression combines an implicitly derived collation with the default collation. But in
SELECT a < ('foo' COLLATE "fr_FR") FROM test1;
the comparison is performed using
fr_FR rules, because the explicit collation derivation overrides the implicit one. Furthermore, given
SELECT a < b FROM test1;
the parser cannot determine which collation to apply, since the
b columns have conflicting implicit collations. Since the
< operator does need to know which collation to use, this will result in an error. The error can be resolved by attaching an explicit collation specifier to either input expression, thus:
SELECT a < b COLLATE "de_DE" FROM test1;
SELECT a COLLATE "de_DE" < b FROM test1;
On the other hand, the structurally similar case
SELECT a || b FROM test1;
does not result in an error, because the
|| operator does not care about collations: its result is the same regardless of the collation.
The collation assigned to a function or operator's combined input expressions is also considered to apply to the function or operator's result, if the function or operator delivers a result of a collatable data type. So, in
SELECT * FROM test1 ORDER BY a || 'foo';
the ordering will be done according to
de_DE rules. But this query:
SELECT * FROM test1 ORDER BY a || b;
results in an error, because even though the
|| operator doesn't need to know a collation, the
ORDER BY clause does. As before, the conflict can be resolved with an explicit collation specifier:
SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
A collation is an SQL schema object that maps an SQL name to locales provided by libraries installed in the operating system. A collation definition has a provider that specifies which library supplies the locale data. One standard provider name is
libc, which uses the locales provided by the operating system C library. These are the locales used by most tools provided by the operating system. Another provider is
icu, which uses the external ICU library. ICU locales can only be used if support for ICU was configured when PostgreSQL was built.
A collation object provided by
libc maps to a combination of
LC_CTYPE settings, as accepted by the
setlocale() system library call. (As the name would suggest, the main purpose of a collation is to set
LC_COLLATE, which controls the sort order. But it is rarely necessary in practice to have an
LC_CTYPE setting that is different from
LC_COLLATE, so it is more convenient to collect these under one concept than to create another infrastructure for setting
LC_CTYPE per expression.) Also, a
libc collation is tied to a character set encoding (see Section 24.3). The same collation name may exist for different encodings.
A collation object provided by
icu maps to a named collator provided by the ICU library. ICU does not support separate “collate” and “ctype” settings, so they are always the same. Also, ICU collations are independent of the encoding, so there is always only one ICU collation of a given name in a database.
On all platforms, the collations named
POSIX are available. Additional collations may be available depending on operating system support. The
default collation selects the
LC_CTYPE values specified at database creation time. The
POSIX collations both specify “traditional C” behavior, in which only the ASCII letters “
A” through “
Z” are treated as letters, and sorting is done strictly by character code byte values.
POSIX locales may behave differently depending on the database encoding.
Additionally, two SQL standard collation names are available:
This collation sorts using the Unicode Collation Algorithm with the Default Unicode Collation Element Table. It is available in all encodings. ICU support is required to use this collation. (This collation has the same behavior as the ICU root locale; see
und-x-icu (for “undefined”).)
This collation sorts by Unicode code point. It is only available for encoding
UTF8. (This collation has the same behavior as the libc locale specification
If the operating system provides support for using multiple locales within a single program (
newlocale and related functions), or if support for ICU is configured, then when a database cluster is initialized,
initdb populates the system catalog
pg_collation with collations based on all the locales it finds in the operating system at the time.
To inspect the currently available locales, use the query
SELECT * FROM pg_collation, or the command
\dOS+ in psql.
For example, the operating system might provide a locale named
initdb would then create a collation named
de_DE.utf8 for encoding
UTF8 that has both
LC_CTYPE set to
de_DE.utf8. It will also create a collation with the
.utf8 tag stripped off the name. So you could also use the collation under the name
de_DE, which is less cumbersome to write and makes the name less encoding-dependent. Note that, nevertheless, the initial set of collation names is platform-dependent.
The default set of collations provided by
libc map directly to the locales installed in the operating system, which can be listed using the command
locale -a. In case a
libc collation is needed that has different values for
LC_CTYPE, or if new locales are installed in the operating system after the database system was initialized, then a new collation may be created using the CREATE COLLATION command. New operating system locales can also be imported en masse using the
Within any particular database, only collations that use that database's encoding are of interest. Other entries in
pg_collation are ignored. Thus, a stripped collation name such as
de_DE can be considered unique within a given database even though it would not be unique globally. Use of the stripped collation names is recommended, since it will make one fewer thing you need to change if you decide to change to another database encoding. Note however that the
POSIX collations can be used regardless of the database encoding.
PostgreSQL considers distinct collation objects to be incompatible even when they have identical properties. Thus for example,
SELECT a COLLATE "C" < b COLLATE "POSIX" FROM test1;
will draw an error even though the
POSIX collations have identical behaviors. Mixing stripped and non-stripped collation names is therefore not recommended.
With ICU, it is not sensible to enumerate all possible locale names. ICU uses a particular naming system for locales, but there are many more ways to name a locale than there are actually distinct locales.
initdb uses the ICU APIs to extract a set of distinct locales to populate the initial set of collations. Collations provided by ICU are created in the SQL environment with names in BCP 47 language tag format, with a “private use” extension
-x-icu appended, to distinguish them from libc locales.
Here are some example collations that might be created:
German collation, default variant
German collation for Austria, default variant
(There are also, say,
de-CH-x-icu, but as of this writing, they are equivalent to
und-x-icu(for “undefined”) #
ICU “root” collation. Use this to get a reasonable language-agnostic sort order.
Some (less frequently used) encodings are not supported by ICU. When the database encoding is one of these, ICU collation entries in
pg_collation are ignored. Attempting to use one will draw an error along the lines of “collation "de-x-icu" for encoding "WIN874" does not exist”.
If the standard and predefined collations are not sufficient, users can create their own collation objects using the SQL command CREATE COLLATION.
The standard and predefined collations are in the schema
pg_catalog, like all predefined objects. User-defined collations should be created in user schemas. This also ensures that they are saved by
New libc collations can be created like this:
CREATE COLLATION german (provider = libc, locale = 'de_DE');
The exact values that are acceptable for the
locale clause in this command depend on the operating system. On Unix-like systems, the command
locale -a will show a list.
Since the predefined libc collations already include all collations defined in the operating system when the database instance is initialized, it is not often necessary to manually create new ones. Reasons might be if a different naming system is desired (in which case see also Section 188.8.131.52.3) or if the operating system has been upgraded to provide new locale definitions (in which case see also
ICU collations can be created like:
CREATE COLLATION german (provider = icu, locale = 'de-DE');
ICU locales are specified as a BCP 47 Language Tag, but can also accept most libc-style locale names. If possible, libc-style locale names are transformed into language tags.
New ICU collations can customize collation behavior extensively by including collation attributes in the langugage tag. See Section 24.2.3 for details and examples.
The command CREATE COLLATION can also be used to create a new collation from an existing collation, which can be useful to be able to use operating-system-independent collation names in applications, create compatibility names, or use an ICU-provided collation under a more readable name. For example:
CREATE COLLATION german FROM "de_DE"; CREATE COLLATION french FROM "fr-x-icu";
A collation is either deterministic or nondeterministic. A deterministic collation uses deterministic comparisons, which means that it considers strings to be equal only if they consist of the same byte sequence. Nondeterministic comparison may determine strings to be equal even if they consist of different bytes. Typical situations include case-insensitive comparison, accent-insensitive comparison, as well as comparison of strings in different Unicode normal forms. It is up to the collation provider to actually implement such insensitive comparisons; the deterministic flag only determines whether ties are to be broken using bytewise comparison. See also Unicode Technical Standard 10 for more information on the terminology.
To create a nondeterministic collation, specify the property
deterministic = false to
CREATE COLLATION, for example:
CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);
This example would use the standard Unicode collation in a nondeterministic way. In particular, this would allow strings in different normal forms to be compared correctly. More interesting examples make use of the ICU customization facilities explained above. For example:
CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false); CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
All standard and predefined collations are deterministic, all user-defined collations are deterministic by default. While nondeterministic collations give a more “correct” behavior, especially when considering the full power of Unicode and its many special cases, they also have some drawbacks. Foremost, their use leads to a performance penalty. Note, in particular, that B-tree cannot use deduplication with indexes that use a nondeterministic collation. Also, certain operations are not possible with nondeterministic collations, such as pattern matching operations. Therefore, they should be used only in cases where they are specifically wanted.
To deal with text in different Unicode normalization forms, it is also an option to use the functions/expressions
is normalized to preprocess or check the strings, instead of using nondeterministic collations. There are different trade-offs for each approach.
ICU allows extensive control over collation behavior by defining new collations with collation settings as a part of the language tag. These settings can modify the collation order to suit a variety of needs. For instance:
-- ignore differences in accents and case CREATE COLLATION ignore_accent_case (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ks-level1'); SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true -- upper case letters sort before lower case. CREATE COLLATION upper_first (PROVIDER=icu, LOCALE = 'und-u-kf-upper'); SELECT 'B' < 'b' COLLATE upper_first; -- true -- treat digits numerically and ignore punctuation CREATE COLLATION num_ignore_punct (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ka-shifted-kn'); SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
Many of the available options are described in Section 184.108.40.206, or see Section 220.127.116.11 for more details.
Comparison of two strings (collation) in ICU is determined by a multi-level process, where textual features are grouped into "levels". Treatment of each level is controlled by the collation settings. Higher levels correspond to finer textual features.
Table 24.1. ICU Collation Levels
The above table shows which textual feature differences are considered significant when determining equality at the given level. The unicode character
U+2063 is an invisible separator, and as seen in the table, is ignored for at all levels of comparison less than
At every level, even with full normalization off, basic normalization is performed. For example,
'á' may be composed of the code points
U&'\0061\0301' or the single code point
U&'\00E1', and those sequences will be considered equal even at the
identic level. To treat any difference in code point representation as distinct, use a collation created with
DETERMINISTIC set to
CREATE COLLATION level3 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level3'); CREATE COLLATION level4 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level4'); CREATE COLLATION identic (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-identic'); -- invisible separator ignored at all levels except identic SELECT 'ab' = U&'a\2063b' COLLATE level4; -- true SELECT 'ab' = U&'a\2063b' COLLATE identic; -- false -- punctuation ignored at level3 but not at level 4 SELECT 'x-y' = 'x_y' COLLATE level3; -- true SELECT 'x-y' = 'x_y' COLLATE level4; -- false
Table 24.2. ICU Collation Settings
||Sensitivity (or "strength") when determining equality, with
||If set to
||Backwards comparison for the level 2 differences. For example, locale
Enable full normalization; may affect performance. Basic normalization is performed even when set to
Full normalization is important in some cases, such as when multiple accents are applied to a single character. For example, the code point sequences
Separates case into a "level 2.5" that falls between accents and other level 3 features.
If set to
||If set to
||If set to
Set to one or more of the valid values, or any BCP 47
Redefines the ordering of classes of characters; those characters belonging to a class earlier in the list sort before characters belonging to a class later in the list. For instance, the value
||Classes of characters ignored during comparison at level 3. Setting to a later value includes earlier values; e.g.
||Collation type. See Section 18.104.22.168 for additional options and details.|
Defaults may depend on locale. The above table is not meant to be complete. See Section 22.214.171.124 for additional options and details.
For many collation settings, you must create the collation with
DETERMINISTIC set to
false for the setting to have the desired effect (see Section 126.96.36.199). Additionally, some settings only take effect when the key
ka is set to
shifted (see Table 24.2).
CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');#
German collation with phone book collation type
CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');#
Root collation with Emoji collation type, per Unicode Technical Standard #51
CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');#
Sort Greek letters before Latin ones. (The default is Latin before Greek.)
CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');#
Sort upper-case letters before lower-case letters. (The default is lower-case letters first.)
CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');#
Combines both of the above options.
This section (Section 24.2.3) is only a brief overview of ICU behavior and language tags. Refer to the following documents for technical details, additional options, and new behavior: