Re: Catastrophic changes to PostgreSQL 8.4

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Kern Sibbald <kern(at)sibbald(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, bacula-devel <bacula-devel(at)lists(dot)sourceforge(dot)net>, bacula-users <bacula-users(at)lists(dot)sourceforge(dot)net>
Subject: Re: Catastrophic changes to PostgreSQL 8.4
Date: 2009-12-03 08:33:57
Message-ID: 4B1777F5.9080804@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Kern Sibbald wrote:
> Hello,
>
> Thanks for all the answers; I am a bit overwhelmed by the number, so I am
> going to try to answer everyone in one email.
>
> The first thing to understand is that it is *impossible* to know what the
> encoding is on the client machine (FD -- or File daemon). On say a
> Unix/Linux system, the user could create filenames with non-UTF-8 then switch
> to UTF-8, or restore files that were tarred on Windows or on Mac, or simply
> copy a Mac directory. Finally, using system calls to create a file, you can
> put *any* character into a filename.

While true in theory, in practice it's pretty unusual to have filenames
encoded with an encoding other than the system LC_CTYPE on a modern
UNIX/Linux/BSD machine.

I'd _very_ much prefer to have Bacula back my machines up by respecting
LC_CTYPE and applying appropriate conversions at the fd if LC_CTYPE on
the fd's host is not utf-8 and the database is.

If the database was SQL_ASCII, it could retain its existing behaviour.

That way, people get to pick between the two rational behaviours:

(1) Store file names as raw byte strings (SQL_ASCII). Guaranteed to work
even on garbage file names that aren't valid in the current system
encoding, but has issues with searches, filename matches, restoring
to another system, etc.

(2) Store file names as UTF-8, performing any required translation from
the system charset at the file daemon. File names that are nonsense
in the system encoding are either (a) rejected with an error in the
fd logs, or (b) backed up with some form of name mangling.

I *strongly* suspect most people will pick the second option.

There's also a third possibility:

(3) As (2), but add a `bytea' column to `path' and `filename' tables
that's null if the fd was able to convert the filename from the
system LC_CTYPE to utf-8. In the rare cases it couldn't (due to
reasons like users running with different LC_CTYPE, nfs volumes
exported to systems with different LC_CTYPE, tarballs from
systems with different charsets, etc) the raw unconverted bytes
of the filename get stored in the bytea field, and a mangled
form of the name gets stored in the text field for user display
purposes only.

I don't know if that'd be worth the hassle, though. I'd just want to use
(2) and I suspect so would a majority of users.

> On MySQL we use BLOBS. On PostgreSQL, we TEXT and set the encoding to
> SQL_ASCII so that PostgreSQL will not attempt to do any translation. This
> works well, and I hope that PostgreSQL will continue to support letting
> Bacula insert text characters in the database with no character encoding
> checks in the future.

Even if that was removed (which I can't see happening) you could use the
bytea type that's designed for exactly that purpose. Pity it's a bit of
a pain to work with :-(

>> Because Bacula does not store the encoding or seem to transcode the file
>> name to a single known encoding, it does not seem to be possible to
>> retrieve files by name if the bacula console is run on a machine with a
>> different text encoding to the machine the files came from. After all,
>> café in utf-8 is a different byte sequence to café in iso-9660-1, and
>> won't match in equality tests under SQL_ASCII.
>
> If all the filenames go in in "binary" or "litteral" form, then any tests will
> work fine. The only test Bacula does is "equality".

Byte strings for the same sequence of characters in different encodings
are not equal. If you're looking for a file called "café.txt" that was
created on a machine with a latin-1 encoding, you cannot find it by
searching for 'café' in bconsole because bconsole will search for the
utf-8 byte sequence for 'café' not the latin-1 byte sequence for 'café'.

$ python
>>> x = u"café"
>>> x.encode("utf-8")
'caf\xc3\xa9'
>>> x.encode("latin-1")
'caf\xe9'
>>> x.encode("utf-8") == x.encode("latin-1")
False

or in Pg:

craig=> SHOW client_encoding;
client_encoding
-----------------
UTF8
(1 row)

craig=> CREATE TABLE test (x text);
CREATE TABLE

craig=> INSERT INTO test(x) VALUES ('café');
INSERT 0 1

craig=> SELECT x, x::bytea FROM test;
x | x
------+-------------
café | caf\303\251
(1 row)

craig=> SELECT convert_to(x, 'latin-1') from test;
convert_to
------------
caf\351
(1 row)

craig=> SELECT convert_to(x, 'utf-8') = x::bytea,
convert_to(x, 'utf-8') = convert_to(x, 'latin-1')
FROM test;
?column? | ?column?
----------+----------
t | f
(1 row)

> The one place where we do convert filenames is on Windows. We convert UCS to
> UTF-8.

Cool. I thought that must be the case, but it's good to know.

What about Mac OS X? It stores file names in UTF-16 normalized form, but
has a variety of ways to access those files, including POSIX interfaces.

Hmm. A quick test suggests that, irrespective of LC_CTYPE, LANG and
LC_ALL, Mac OS X converts file names to UTF-8 for use with POSIX APIs. A
bit of digging helps confirm that:

http://lists.apple.com/archives/applescript-users/2002/Sep/msg00319.html

... so Mac OS X is no problem. It's just traditional UNIXes which refuse
to standardize on a unicode encoding at the file system level that're a
problem.

(It drives me NUTS that Linux file systems don't do everything in
unicode, with the VFS layer or even glibc converting between unicode and
LC_CTYPE for programs that want another encoding. Even Microsoft figured
out the need for this one in 1998!)

> Yes, if you restore on a different system with a different encoding, you will
> end up with the same binary string at the OS level, but when you see the
> filenames they may look different.

... which to most users equates to "argh, Bacula corrupted my backup!"
or "argh, all my filenames are garbage!". Much as they'll say "I changed
my system language and now my filenames are all garbage" if they change
LC_CTYPE...

>> Personally, I'd like to see Bacula capable of using a utf-8 database,
>> with proper encoding conversion at the fd for non-utf-8 encoded client
>> systems. It'd really simplify managing backups for systems with a
>> variety of different encodings.
>
> Maybe this will happen someday, but first we will need Linux/Unix systems that
> *force* the encoding to be in some particular standard.

That would make me a happy, happy fellow.

I don't really see why it can't be done in glibc now, in truth. Convert
everything that goes through a system call to/from utf-8.

> I didn't actually run it myself, so it is possible that it produced an error
> message, but it did apparently create the database but with UTF-8 encoding.
> Most of these things are done in script files, so certain non-fatal errors
> may be overlooked.
>
> As far as I can tell, it took the above encoding command, and perhaps printed
> an error message but went ahead and created the database with an encoding
> that was not correct.

If that's the case, it'd be REALLY good to find out more about the
systems it happened on. Affected Bacula users, please speak up now. Tom
Lane pointed out and demonstrated that it should report an error and
fail to create the DB in these cases, so if it's not that's something
that needs investigation.

Are you sure it wasn't users (or "helpful" distro packagers) altering
create_postgresql_database to issue a CREATE DATABASE statement with
utf-8 ... perhaps after getting errors they didn't understand when using
SQL_ASCII ?

--
Craig Ringer

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Sim Zacks 2009-12-03 09:13:32 numeric cast oddity
Previous Message Pavel Stehule 2009-12-03 08:29:55 Re: Catastrophic changes to PostgreSQL 8.4

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurent Laborde 2009-12-03 08:51:25 Re: Cost of sort/order by not estimated by the query planner
Previous Message Pavel Stehule 2009-12-03 08:29:55 Re: Catastrophic changes to PostgreSQL 8.4