Re: BUG #5661: The character encoding in logfile is confusing.

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: BUG #5661: The character encoding in logfile is confusing.
Date: 2010-09-22 11:25:47
Message-ID: 4C99E7BB.40402@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 22/09/2010 5:45 PM, Peter Eisentraut wrote:
> On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote:
>> A single log file should obviously be in a single encoding, it's the
>> only sane way to do things. But which encoding is it in? And which
>> *should* it be in?
>
> We need to produce the log output in the server encoding, because that's
> how we need to send it to the client.

That doesn't mean it can't be recoded for writing to the log file,
though. Perhaps it needs to be. It should be reasonably practical to
detect when the database and log encoding are the same and avoid the
transcoding performance penalty, not that it's big anyway.

> If you have different databases
> with different server encodings, you will get inconsistently encoded
> output in the log file.

I don't think that's an OK answer, myself. Mixed encodings with no
delineation in one file = bug as far as I'm concerned. You can't even
rely on being able to search the log anymore. You'll only get away with
it when using languages that mostly stick to the 7-bit ASCII subset, so
most text is still readable; with most other languages you'll get logs
full of what looks to the user like garbage.

> Conceivably, we could create a configuration option that specifies the
> encoding for the log file, and strings a recoded from whatever gettext()
> produces to the specified encoding. initdb could initialize that option
> suitably, so in most cases users won't have to do anything.

Yep, I tend to think that'd be the right way to go. It'd still be a bit
of a pain, though, as messages written to stdout/stderr by the
postmaster should be in the system encoding, but messages written to the
log files should be in the encoding specified for logs, unless logging
is being done to syslog, in which case it has to be in the system
encoding after all...

And, of course, the postmaster still doesn't know how to log anything it
might emit before reading postgresql.conf, because it doesn't know what
encoding to use.

I still wonder if, rather than making this configurable, the right
choice is to force logging to UTF-8 (with BOM) across the board, right
from postmaster startup. It's consistent, all messages in all other
encodings can be converted to UTF-8 for logging, it's platform
independent, and text editors etc tend to understand and recognise UTF-8
especially with the BOM.

Unfortunately, because many unix utilities (grep etc) aren't encoding
aware, that'll cause problems when people go to search log files. For
(eg) "広告掲載" The log files will contain the utf-8 bytes:

\xe5\xba\x83\xe5\x91\x8a\xe6\x8e\xb2\xe8\xbc\x89

but grep on a shift-jis system will be looking for:

\x8d\x4c\x8d\x90\x8cf\x8d\xda

so it won't match.

Ugh. If only we could say "PostgreSQL requires a system locale with a
UTF-8 encoding". Alas, I don't think that'd go down very well with
packagers or installers. [Insert rant about how stupid it is that *nix
systems still aren't all UTF-8 here].

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Dave Page 2010-09-22 11:30:06 Re: BUG #5661: The character encoding in logfile is confusing.
Previous Message Peter Eisentraut 2010-09-22 09:45:22 Re: BUG #5661: The character encoding in logfile is confusing.

Browse pgsql-hackers by date

  From Date Subject
Next Message Stefan Kaltenbrunner 2010-09-22 11:27:24 Re: snapshot generation broken
Previous Message Magnus Hagander 2010-09-22 11:25:13 Re: snapshot generation broken