Re: BUG #5661: The character encoding in logfile is confusing.

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: BUG #5661: The character encoding in logfile is confusing.
Date: 2010-09-22 08:25:33
Message-ID: 4C99BD7D.1080409@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

[moving to pgsql-hackers; this isn't the simple bug I initially
suspected it might be]

On 20/09/10 03:10, Tom Lane wrote:
> Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes:
>> One of the correctly encoded messages is "Unexpected EOF received on
>> client connection"
>
>> One of the incorrectly encoded (shift-JIS) messages is: "Fast Shutdown
>> request received". Another is "Aborting any active transactions".
>
>> ... question now is where the messages are converted from UTF-8 to shift-JIS
>> and why that conversion is being applied inconsistently.
>
> Given those three examples, I wonder whether all the mis-encoded
> messages are emitted by the postmaster, rather than backends.
> Anyway it seems that you ought to look for some pattern in which
> messages are correctly vs incorrectly encoded.

I think you're right. Looking into it more, though, I'm not even sure
what the correct behaviour even is. I don't think this is a simple bug
where Pg fails to convert between encodings in a few places; rather,
it's a design oversight where the effect of having a system encoding
different from the encoding of the database(s) isn't considered.

A single log file should obviously be in a single encoding, it's the
only sane way to do things. But which encoding is it in? And which
*should* it be in?

- The system text encoding? This is what the postmaster will have from
its environment, and is what the user will expect the logs to be in.
Postmaster will emit messages in this encoding at least during
startup, as it doesn't know what encoding the cluster uses yet.
(In fact it seems to stick to the system encoding throughout its
life).

- The default database encoding supplied to initdb during cluster
creation?

- The encoding of the database emitting a message? This makes sense
when considering RAISE messages, for example. Backends will currently
use this encoding when emitting log messages, whether user-supplied
or translated from po files.

This confusion leads to the mixed encoding issues reported by the OP.
It's not a simple bug, it's a design issue.

Unfortunately, it's not as simple as picking one of the above encodings
for all logging.

The system encoding isn't a good choice, because it might not be capable
of representing all characters emitted by user RAISE statements in
databases with a different encoding, nor all "double quoted"
identifiers, parameter values, etc etc etc. For example, if the system
encoding is SHIFT-JIS, but user databases emit messages with Chinese,
Cyrillic, extended latin, or pretty much any non-Japanese characters,
there's no sane way to convert messages containing any user text to
shift-JIS for logging. The same applies with a latin-1 (iso-8859-1)
system encoding and a utf-8 or shift-jis database emitting Japanese
messages. Scratch using the system encoding for logging.

What about the encoding used by initdb to create the cluster? It seems
sensible, but:
- The postmaster doesn't know what it is when it's doing it's initial
startup. How can the postmaster complain that it can't find / open
the cluster datadir when it doesn't know what encoding to use for the
complaint?
- If the cluster isn't created as utf-8, the same issue as with the
system encoding applies.

Using the encoding of the emitting database will permit all messages to
be represented, but will give rise to mixed encodings in the log file,
and still won't help the postmaster know what to do before it's found
and read the cluster.

I'm now inclined to propose that all logging be done unconditionally in
utf-8, with a BOM written to the start of every log file. Backends with
non-utf-8 databases should convert messages to utf-8 for logging.
Because PostgreSQL supports the use of different encodings in different
databases this is the only way to ensure sane logging with consistent
encoding in a single log file.

The only alternative I see is to break logging out into separate files:
- postmaster.log for postmaster etc
- [databasename].log for each database, in that database's encoding
... but I'm not confident that'd be worth the confusion.

Neither scheme solves the question of what to do when logging to syslog,
though. Syslog expects messages in the system encoding, and Pg would be
wrong to log in any other encoding. Yet as databases may have characters
that cannot be represented in the system encoding, the system encoding
isn't good enough. Should syslog messages be converted to the system
encoding with non-representable characters replaced by "?" or some other
placeholder? Blech.

Ideas? Suggestions?

--
Craig Ringer

Tech-related writing: http://soapyfrogs.blogspot.com/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Eisentraut 2010-09-22 09:45:22 Re: BUG #5661: The character encoding in logfile is confusing.
Previous Message Benjamin Gigot 2010-09-22 07:17:01 BUG #5672: Can't input julian days BC

Browse pgsql-hackers by date

  From Date Subject
Next Message Stefan Kaltenbrunner 2010-09-22 08:33:52 snapshot generation broken
Previous Message Heikki Linnakangas 2010-09-22 08:22:28 Re: Needs Suggestion