This page in other versions: 9.0 / 9.1 / 9.2 / 9.3  |  Development versions: devel

F.40. unaccent

unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes. It's a filtering dictionary, which means its output is always passed to the next dictionary (if any), unlike the normal behavior of dictionaries. This allows accent-insensitive processing for full text search.

The current implementation of unaccent cannot be used as a normalizing dictionary for the thesaurus dictionary.

F.40.1. Configuration

An unaccent dictionary accepts the following options:

  • RULES is the base name of the file containing the list of translation rules. This file must be stored in $SHAREDIR/tsearch_data/ (where $SHAREDIR means the PostgreSQL installation's shared-data directory). Its name must end in .rules (which is not to be included in the RULES parameter).

The rules file has the following format:

  • Each line represents a pair, consisting of a character with accent followed by a character without accent. The first is translated into the second. For example,

    À        A
    Á        A
    Â        A
    Ã        A
    Ä        A
    Å        A
    Æ        A
    

A more complete example, which is directly useful for most European languages, can be found in unaccent.rules, which is installed in $SHAREDIR/tsearch_data/ when the unaccent module is installed.

F.40.2. Usage

Installing the unaccent extension creates a text search template unaccent and a dictionary unaccent based on it. The unaccent dictionary has the default parameter setting RULES='unaccent', which makes it immediately usable with the standard unaccent.rules file. If you wish, you can alter the parameter, for example

mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');

or create new dictionaries based on the template.

To test the dictionary, you can try:

mydb=# select ts_lexize('unaccent','Hôtel');
 ts_lexize
-----------
 {Hotel}
(1 row)

Here is an example showing how to insert the unaccent dictionary into a text search configuration:

mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
mydb=# ALTER TEXT SEARCH CONFIGURATION fr
        ALTER MAPPING FOR hword, hword_part, word
        WITH unaccent, french_stem;
mydb=# select to_tsvector('fr','Hôtels de la Mer');
    to_tsvector
-------------------
 'hotel':1 'mer':4
(1 row)

mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
 ?column?
----------
 t
(1 row)

mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
      ts_headline
------------------------
 <b>Hôtel</b> de la Mer
(1 row)

F.40.3. Functions

The unaccent() function removes accents (diacritic signs) from a given string. Basically, it's a wrapper around the unaccent dictionary, but it can be used outside normal text search contexts.

unaccent([dictionary, ] string) returns text

For example:

SELECT unaccent('unaccent', 'Hôtel');
SELECT unaccent('Hôtel');

Add Comment

Please use this form to add your own comments regarding your experience with particular features of PostgreSQL, clarifications of the documentation, or hints for other users. Please note, this is not a support forum, and your IP address will be logged. If you have a question or need help, please see the faq, try a mailing list, or join us on IRC. Note that submissions containing URLs or other keywords commonly found in 'spam' comments may be silently discarded. Please contact the webmaster if you think this is happening to you in error.

Proceed to the comment form.

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group