Re: DB encoding, locale and indexes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Sterfield <sterfield(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: DB encoding, locale and indexes
Date: 2015-02-05 14:56:11
Message-ID: 5316.1423148171@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Sterfield <sterfield(at)gmail(dot)com> writes:
> I'm a sysadmin working for an application that stores all its data in a PG
> database.
> Currently, the cluster has its encoding set to UTF-8, and the locale (both
> LC_COLLATE and LC_CTYPE) is set to 'en_US.UTF-8'.

> I discovered recently that the indexes created on varchar fields are not
> working for LIKE operator, as they are created without the correct class
> operator (as stated in
> http://www.postgresql.org/docs/9.2/static/indexes-types.html).

Right, because en_US.UTF-8 uses dictionary sort order rather than plain
byte-by-byte sort.

> The most straight-forward solution would be to create a second index on the
> same field but with the class operator, in order to have indexes both for
> =, >, < operators AND LIKE / regexp operators. Few additional indexes, some
> diskspace eaten, problem solved.

Yup.

> However, some people are saying that nothing has to change on the index,
> and that the only thing we have to do is to change the LC_COLLATE of each
> databases to 'C', in order for the indexes to work without the class
> operator.

Yes, that is another possible solution, and it's documented. Keep in mind
though that you can *not* just reach into pg_database and tweak those
fields; if you did, all your indexes would be corrupt, because they'd no
longer match the sort order the system is expecting. The only safe way to
get there would be to dump and reload into a new database set up this way.
(If you wanted to live dangerously, I guess you could manually tweak the
pg_database fields and then REINDEX every affected index ... but this
still involves substantial downtime, and I would not recommend doing it
without practicing on a test installation.)

You also have to ask whether any of your applications are expecting ORDER
BY some-text-field to produce dictionary order rather than ASCII order.

> - If I have unicode character stored in my database (for example ''),
> and the LC_COLLATE set to 'C', how the index will behave if I do a query
> with LIKE '<something>%' ?

It's still the same character, but it will sort in a possibly unexpected
way.

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Sterfield 2015-02-05 15:18:39 Re: DB encoding, locale and indexes
Previous Message Igor Neyman 2015-02-05 14:04:30 Re: Error: could not read symbolic link "pg_tblspc/940585". No such file or directory