Re: Catastrophic changes to PostgreSQL 8.4

From: Kern Sibbald <kern(at)sibbald(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: pgsql-general(at)postgresql(dot)org, "bacula-devel" <bacula-devel(at)lists(dot)sourceforge(dot)net>, "bacula-users" <bacula-users(at)lists(dot)sourceforge(dot)net>
Subject: Re: Catastrophic changes to PostgreSQL 8.4
Date: 2009-12-03 07:33:38
Message-ID: 200912030833.38989.kern@sibbald.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Hello,

Thanks for all the answers; I am a bit overwhelmed by the number, so I am
going to try to answer everyone in one email.

The first thing to understand is that it is *impossible* to know what the
encoding is on the client machine (FD -- or File daemon). On say a
Unix/Linux system, the user could create filenames with non-UTF-8 then switch
to UTF-8, or restore files that were tarred on Windows or on Mac, or simply
copy a Mac directory. Finally, using system calls to create a file, you can
put *any* character into a filename.

So, rather than trying to figure everything out (impossible, I think) and
rather than failing to backup files, Bacula gets the "raw" filename from the
OS and stores it on the Volume then puts it in the database. We treat the
filename as if it is UTF-8 for display purposes, but in all other cases, what
we want is for the filename to go into the database and come back out
unchanged.

On MySQL we use BLOBS. On PostgreSQL, we TEXT and set the encoding to
SQL_ASCII so that PostgreSQL will not attempt to do any translation. This
works well, and I hope that PostgreSQL will continue to support letting
Bacula insert text characters in the database with no character encoding
checks in the future.

See more notes below ...

On Thursday 03 December 2009 03:54:07 Craig Ringer wrote:
> On 2/12/2009 9:18 PM, Kern Sibbald wrote:
> > Hello,
> >
> > I am the project manager of Bacula. One of the database backends that
> > Bacula uses is PostgreSQL.
>
> As a Bacula user (though I'm not on the Bacula lists), first - thanks
> for all your work. It's practically eliminated all human intervention
> from something that used to be a major pain. Configuring it to handle
> the different backup frequencies, retention periods and diff/inc/full
> needs of the different data sets was a nightmare, but once set up it's
> been bliss. The 3.x `Accurate' mode is particularly nice.
>
> > Bacula sets the database encoding to SQL_ASCII, because although
> > Bacula "supports" UTF-8 character encoding, it cannot enforce it.
> > Certain operating systems such as Unix, Linux and MacOS can have
> > filenames that are not in UTF-8 format. Since Bacula stores filenames in
> > PostgreSQL tables, we use SQL_ASCII.
>
> I noticed that while doing some work on the Bacula database a while ago.
>
> I was puzzled at the time about why Bacula does not translate file names
> from the source system's encoding to utf-8 for storage in the database,
> so all file names are known to be sane and are in a known encoding.

We don't and cannot know the encoding scheme on Unix/Linux systems (see
above), so attempting to convert them to UTF-8 would just consume more CPU
time and result in errors at some point.

>
> Because Bacula does not store the encoding or seem to transcode the file
> name to a single known encoding, it does not seem to be possible to
> retrieve files by name if the bacula console is run on a machine with a
> different text encoding to the machine the files came from. After all,
> café in utf-8 is a different byte sequence to café in iso-9660-1, and
> won't match in equality tests under SQL_ASCII.

If all the filenames go in in "binary" or "litteral" form, then any tests will
work fine. The only test Bacula does is "equality". Bacula doesn't worry
about sorting. Users may care, but for backup and restore the only test
Bacula needs is equality, and as long as you are working with unchanged byte
streams everything works on every system.

The one place where we do convert filenames is on Windows. We convert UCS to
UTF-8.

>
> Additionally, I'm worried that restoring to a different machine with a
> different encoding may fail, and if it doesn't will result in hopelessly
> mangled file names. This wouldn't be fun to deal with during disaster
> recovery. (I don't yet know if there are provisions within Bacula its
> self to deal with this and need to do some testing).

Yes, if you restore on a different system with a different encoding, you will
end up with the same binary string at the OS level, but when you see the
filenames they may look different.

>
> Anyway, it'd be nice if Bacula would convert file names to utf-8 at the
> file daemon, using the encoding of the client, for storage in a utf-8
> database.

As I mention, this is not possible since Unix/Linux stores "binary" strings.
They can be in any format.

>
> Mac OS X (HFS Plus) and Windows (NTFS) systems store file names as
> Unicode (UTF-16 IIRC). Unix systems increasingly use utf-8, but may use
> other encodings. If a unix system does use another encoding, this may be
> determined from the locale in the environment and used to convert file
> names to utf-8.

As I mentioned above, on Linux/Unix systems, the user is free to change the
encoding at will, and when he does so, existing filenames remain unchanged,
so it is not possible to choose a particular encoding and backup and restore
files without changing the filenames. Bacula backs them up and restores them
using "binary" strings so we don't have problems of changing encoding.

>
> Windows systems using FAT32 and Mac OS 9 machines on plain old HFS will
> have file names in the locale's encoding, like UNIX systems, and are
> fairly easily handled.
>
> About the only issue I see is that systems may have file names that are
> not valid text strings in the current locale, usually due to buggy
> software butchering text encodings. I guess a *nix system _might_ have
> different users running with different locales and encodings, too. The
> latter case doesn't seem easy to handle cleanly as file names on unix
> systems don't have any indication of what encoding they're in stored
> with them. I'm not really sure these cases actually show up in practice,
> though.
>
> Personally, I'd like to see Bacula capable of using a utf-8 database,
> with proper encoding conversion at the fd for non-utf-8 encoded client
> systems. It'd really simplify managing backups for systems with a
> variety of different encodings.

Maybe this will happen someday, but first we will need Linux/Unix systems that
*force* the encoding to be in some particular standard.

>
> ( BTW, one way to handle incorrectly encoded filenames and paths might
> be to have a `bytea' field that's generally null to store such mangled
> file names. Personally though I'd favour just rejecting them. )
>
> > We set SQL_ASCII by default when creating the database via the command
> > recommended in recent versions of PostgreSQL (e.g. 8.1), with:
> >
> > CREATE DATABASE bacula ENCODING 'SQL_ASCII';
> >
> > However, with PostgreSQL 8.4, the above command is ignored because the
> > default table copied is not template0.
>
> It's a pity that attempting to specify an encoding other than the safe
> one when using a non-template0 database doesn't cause the CREATE
> DATABASE command to fail with an error.

I didn't actually run it myself, so it is possible that it produced an error
message, but it did apparently create the database but with UTF-8 encoding.
Most of these things are done in script files, so certain non-fatal errors
may be overlooked.

As far as I can tell, it took the above encoding command, and perhaps printed
an error message but went ahead and created the database with an encoding
that was not correct. If that is indeed the case, then it is in my opinion,
a bad design policy. I would much prefer that either Postgres accept the
command or that it not create the database. This way, either the database
would work as the user expects or there would be no database, and the problem
would be resolved before it creates databases that cannot be read.

In any case we have corrected the command to include the TEMPLATE, but this
won't help people with older Bacula's.

The other point I wanted to emphasize is that the documentation implied that
future versions of Postgres may eliminate the feature of having SQL_ASCII
(i.e. the ability to input arbritrary binary strings). As I said, that would
be a pity -- I suppose we could switch to using LOs or whatever they are
called in Postgres, but that would be rather inconvenient.

Thanks for all the responses,

Best regards,

Kern

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Morus Walter 2009-12-03 07:55:55 Re: deferrable foreign keys
Previous Message A. Kretschmer 2009-12-03 06:27:49 Re: How to auto-increment?

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2009-12-03 08:29:55 Re: Catastrophic changes to PostgreSQL 8.4
Previous Message Itagaki Takahiro 2009-12-03 07:05:47 Re: ProcessUtility_hook