Re: BUG #5800: "corrupted" error messages (encoding problem ?)

From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Carlo Curatolo <genamiga(at)brutele(dot)be>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5800: "corrupted" error messages (encoding problem ?)
Date: 2011-09-29 07:44:37
Message-ID: 4E8421E5.7090407@ringerc.id.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

First, sorry for the slow reply.

Response inline.

On 09/17/2011 08:34 AM, Tom Lane wrote:
> Craig Ringer<ringerc(at)ringerc(dot)id(dot)au> writes:
>> On 09/17/2011 05:10 AM, Carlo Curatolo wrote:
>>> Just tried with PG 9.1...same problem...
>
>> Yep. There appears to be no interest in fixing this bug. All the
>> alternatives I proposed were rejected, and there doesn't seem to be any
>> concern about the issue.
>
> The problem is to find a cure that's not worse than the disease.
> I'm not exactly convinced that forcing all log messages into a common
> encoding is a better behavior than allowing backends to log in their
> native database encoding.
>
> If you do want a common encoding, there's a very easy way to get it, ie,
> standardize on one encoding for all your databases.

The postmaster may still emit messages in a different encoding if the
system encoding is not the same as the standard database encoding chosen.

> People who aren't
> doing that already probably have good reasons why they want to stay with
> the encoding choices they've made; forcing their logs into some other
> encoding isn't necessarily going to improve their lives.

I'm not convinced.

Mixing their logs with messages in other encodings makes it *impossible*
for most people to read them at all. A file with (say) mixed UTF-8,
latin-1 and Shift-JIS is effectively hopelessly corrupted as far as most
people are concerned. If lines are differently encoded, the file is a
totally mangled mess. Try it and see what I mean. As such, I disagree:
forcing all their logs into one encoding WILL improve their lives over
the current situation, and won't affect people whose databases are all
already in the system encoding.

In any case, if the system uses a utf8 encoding and the databases are
latin-1 (for example) the admin might actually prefer to have utf8 logs
for easy reading and processing by system tools, no matter what encoding
the databases are in.

The database encoding is an internal thing. The log encoding is an
external thing. Writing messages to stdout/stderr in an encoding other
than that specified by LC_CTYPE and LC_MESSAGES is wrong as it'll cause
garbage to be shown on a terminal; so IMO is logging in a different
encoding.

Because there's no standard way to flag a file as having a certain
encoding, I contend that the correct default is to write files in the
default encoding used by the system. That is what programs that consume
the logs will expect. The only other correct alternative would be to
write UTF-8 logs with a BOM that lets programs unamgiguously identify
the encoding. That said, users probably should be able to override the
log file location and encoding so a particular database's logs go to a
separate file in a user-defined encoding and/or override the default
encoding Pg writes.

>> ... The only valid fixes are to log them to different files (with some
>> way to identify which encoding is used)
>
> I don't recall having heard any serious discussion of such a design, but
> perhaps doing that would satisfy some use-cases. One idea that comes to
> mind is to provide a %-escape for log_filename that expands to the name
> of the database encoding (or more likely, some suitable abbrevation).
> The logging collector protocol would have to be expanded to include that
> information, but that seems do-able.

That'd work, though it doesn't solve the problem for people logging to
syslog or to a single file.

I think Pg should also be able to convert all messages into a common
encoding for logging to a single file and should default to using the
system encoding as that encoding.

The user could configure a different encoding - for example, they might
want to force utf-8 logging because their databases may have all sorts
of different encodings, but they're logging to syslog so they can't
split logs out to different files.

A special log destination encoding name, say "log_encoding = database"
could be used to bypass all encoding conversion, retaining the current
behaviour of logging in whatever encoding the database happens to use.

I'm willing to implement this setup (or try, at least) if you think it's
a reasonable thing to do. I don't know how I'll go with multi-file
logging in log_filename, but I'm pretty sure I can handle the log
message encoding conversion and associated configuration directives.

There's some overhead to encoding conversion, but it's pretty minimal.
It can be avoided entirely by ensuring that your log destination
encoding is the same as your Pg database encoding, which under this
scheme you can do by setting "log_encoding = database" and sticking to
one encoding or using multi-file logging.

Reasonable plan?

--
Craig Ringer

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Craig Ringer 2011-09-29 07:48:00 Re: BUG #6233: pg_dump hangs with Access Violation C0000005
Previous Message Tom Lane 2011-09-29 04:08:14 Re: BUG #6232: hstore operator ? no longer uses indexes