Re: Encoding, Unicode, locales, etc.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Carlos Moreno <moreno_pg(at)mochima(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Encoding, Unicode, locales, etc.
Date: 2006-11-01 04:47:56
Message-ID: 12356.1162356476@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Carlos Moreno <moreno_pg(at)mochima(dot)com> writes:
> Why is it that the database
> cluster is resrticted to a single locale (or single set of locales) instead
> of being configurable on a per-database basis?

Because we depend on libc's locale support, which (on many platforms)
isn't designed to switch between locales cheaply. The fact that we
allow a per-database encoding spec at all was probably a bad idea in
hindsight --- it's out front of what the code can really deal with.
My recollection is that the Japanese contingent argued for it on the
grounds that they needed to deal with multiple encodings and didn't
care about encoding/locale mismatch because they were going to use
C locale anyway. For everybody else though, it's a gotcha waiting
to happen.

This stuff is certainly far from ideal, but the amount of work involved
to fix it is daunting; see many past pg-hackers discussions.

> 2) On the same token (more or less), I have a test database, for which
> I ran initdb without specifying encoding or locale; then, I create a
> database with UTF8 encoding.

There's no such thing as "you didn't specify a locale". If you didn't
specify one on the initdb command line, then it was taken from the
environment. Try "show lc_collate" and "show lc_ctype" to see what
got used.

> I try lower of a string that
> contains characters with accents (e.g., Spanish or French characters),
> and it works as it should according to Spanish or French rules --- it
> returns a string with the same characters in lowecase, with the same
> accent. Why did that work? My Linux machine has all en_US.UTF-8
> locales, and en_US is not even aware of characters with accents,

You sure? I'd sort of expect a UTF8 locale to know this stuff anyway.
In any case, Postgres doesn't know anything about case conversion
beyond what toupper/tolower tell it, so your experimental result is
sufficient proof that that locale includes these conversions.

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Joshua D. Drake 2006-11-01 04:55:04 Re: [HACKERS] Index greater than 8k
Previous Message Alvaro Herrera 2006-11-01 04:44:01 Re: [HACKERS] Index greater than 8k