Re: psql weird behaviour with charset encodings

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: hernan gonzalez <hgonzalez(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: psql weird behaviour with charset encodings
Date: 2010-05-07 23:46:42
Message-ID: 3797.1273276002@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

hernan gonzalez <hgonzalez(at)gmail(dot)com> writes:
> The issue is that psql tries (apparently) to convert to UTF8
> (even when he plans to output the raw text -LATIN9 in this case)
> just for computing the lenght of the field, to build the table.
> And because for this computation he (apparently) rely on the string
> routines with it's own locale, instead of the DB or client encoding.

I didn't believe this, since I know perfectly well that the formatting
code doesn't rely on any OS-supplied width calculations. But when I
tested it out, I found I could reproduce Hernan's problem on Fedora 11.
Some tracing showed that the problem is here:

fprintf(fout, "%.*s", bytes_to_output,
this_line->ptr + bytes_output[j]);

As the variable name indicates, psql has carefully calculated the number
of *bytes* it wants to print. However, it appears that glibc's printf
code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding. I haven't dug into the glibc code to
check, but it's presumably barfing because the string isn't valid
according to UTF8 encoding, and then failing to print anything.

It appears to me that this behavior violates the Single Unix Spec,
which says very clearly that the count is a count of bytes:
http://www.opengroup.org/onlinepubs/007908799/xsh/fprintf.html
However, I'm quite sure that our chances of persuading the glibc boys
that this is a bad idea are zero. I think we're going to have to
change the code to not rely on %.*s here. Even without the charset
mismatch in Hernan's example, we'd be printing the wrong amount of
data anytime the LC_CTYPE charset is multibyte. (IOW, the code should
do the wrong thing with forced-line-wrap cases if LC_CTYPE is UTF8,
even if client_encoding is too; anybody want to check?)

The above coding is new in 8.4, but it's probably not the only use of
%.*s --- we had better go looking for other trouble spots, too.

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message hgonzalez 2010-05-08 01:48:53 Re: psql weird behaviour with charset encodings
Previous Message Tom Lane 2010-05-07 22:32:57 Re: initdb fails on Centos 5.4 x64

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-05-07 23:49:15 Re: no universally correct setting for fsync
Previous Message Bernd Helmle 2010-05-07 23:32:59 Re: no universally correct setting for fsync