Re: BUG #5661: The character encoding in logfile is confusing.

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: BUG #5661: The character encoding in logfile is confusing.
Date: 2010-09-25 03:33:03
Message-ID: 4C9D6D6F.4050806@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 09/22/2010 09:55 PM, Tom Lane wrote:
> Peter Eisentraut<peter_e(at)gmx(dot)net> writes:
>> On ons, 2010-09-22 at 19:25 +0800, Craig Ringer wrote:
>>> I still wonder if, rather than making this configurable, the right
>>> choice is to force logging to UTF-8 (with BOM) across the board,
>
>> I don't think this would make things better or easier. At some point
>> you're going to have to insert a recode call, and it doesn't matter much
>> whether the destination argument is a constant or a variable.
>
> It'd avoid the problem of having possibly-unconvertable messages ...
> at the cost of pissing off users who have a uniform server encoding
> selection already and don't see why they should be forced to deal with
> UTF8 in the log.
>
> It's pretty much just one step from here to deciding that the server
> should work exclusively in UTF8 and never mind all those other legacy
> encodings. We've resisted that attitude for quite some years now,
> and are probably not really ready to adopt it for the log either.

Fair enough. The current approach is broken, though. Mis-encoded
messages the user can't read are little more good to them than messages
that're never logged.

I see four options here (two of which are practical IMO):

(1) Log in UTF-8, convert everything to UTF-8. Better for admin tools &
apps, sucks for OS utilities/grep/etc on non-utf-8 locales. Preserves
all messages no matter what the database and system encodings are.

(2) Log in default encoding for locale, convert all messages to that
encoding. Where characters cannot be represented in the target encoding
replace them with a placeholder (? or something). Better - but far from
good - for OS utilities/grep/etc, sucks for admin tools and apps.
Doesn't preserve all messages properly if user has databases in
encodings other than the system encoding.

(3) Have a log for the postmaster in the default locale for the system.
Have a log file for each database that's in the encoding for that
database. IMO this is the worst of both worlds, but it does preserve
original encodings without transcoding or forcing a particular encoding
and does preserve messages. Horribly complicated for admin tools,
inconsistent and horrid for grep etc.

(4) Keep things much as they are, but log an encoding identifier prefix
for each line. Lets GUI/admin tools post-process the logs into something
sane, permits automated log processing because line encodings are known.
Sucks for shell tools, which can't tell which lines are which; we'd need
to provide a "pggrep" and "pgless" for reliable log search! Preserves
all messages, but not in a reliably searchable manner.

(0) Change nothing. Log all messages in the original encoding they were
generated in. Perform no conversion. Logs contain mixed encodings.
Horrible for admin/gui tools (broken text). Horrible for shell
utilities/OS tools (can't trust grep results etc). Automatic log
processing impossible as the encoding for each line isn't known and
can't be reliably discovered.

As far as I'm concerned, (3) is out. It's horrible. I don't think the
status quo (0) is OK either, it's producing broken log files. (4) is
pretty awful too, but it's the smallest change that kind-of fixes the
issue to the point where it's at least possible for PgAdmin etc to
convert the logs into a consistent encoding.

IMO it's down to (1) and (2). There's no clear consensus between those
two, so I'd be inclined to offer the admin the choice between them as a
config option, depending on the trade-off they prefer to make.

For sensible systems in a utf-8 locale (1) and (2) are equivalent, and
(2) is fine for systems where the database encoding is always the same
as the system encoding. It's only for systems with a non-utf-8 locale
that use databases in encodings other than the system locale's encoding
that problems arise. In this case they're going to get suboptimal
results one way or the other, it's just a matter of letting them pick how.

Thoughts?

I should ask on the various language-specific mailing lists and see what
people there have to say about it. Maybe it doesn't affect people enough
in practice for them to care.

--
Craig Ringer

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Craig Ringer 2010-09-25 06:48:34 Re: BUG #5661: The character encoding in logfile is confusing.
Previous Message Craig Ringer 2010-09-25 03:01:47 Re: Mapping Hibernate boolean to smallint(Postgresql)

Browse pgsql-hackers by date

  From Date Subject
Next Message Darren Duncan 2010-09-25 03:51:40 Re: What happened to the is_<type> family of functions proposal?
Previous Message Robert Haas 2010-09-25 03:15:49 Re: What happened to the is_<type> family of functions proposal?