Re: [GENERAL] psql weird behaviour with charset encodings

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, hgonzalez(at)gmail(dot)com
Subject: Re: [GENERAL] psql weird behaviour with charset encodings
Date: 2015-06-02 07:19:40
Message-ID: CAB7nPqS6GdCefz_Yy3EuWiu2-v0vUtVxE6Yix-2OV1fvBwzssA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Sun, May 24, 2015 at 2:43 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Sat, May 08, 2010 at 09:24:45PM -0400, Tom Lane wrote:
>> hgonzalez(at)gmail(dot)com writes:
>> > http://sources.redhat.com/bugzilla/show_bug.cgi?id=649
>>
>> > The last explains why they do not consider it a bug:
>>
>> > ISO C99 requires for %.*s to only write complete characters that fit
below
>> > the
>> > precision number of bytes. If you are using say UTF-8 locale, but
ISO-8859-1
>> > characters as shown in the input file you provided, some of the
strings are
>> > not valid UTF-8 strings, therefore sprintf fails with -1 because of the
>> > encoding error. That's not a bug in glibc.
>>
>> Yeah, that was about the position I thought they'd take.
>
> GNU libc eventually revisited that conclusion and fixed the bug through
commit
> 715a900c9085907fa749589bf738b192b1a2bda5. RHEL 7.1 is fixed, but RHEL
6.6 and
> RHEL 5.11 are still affected; the bug will be relevant for another 8+
years.
>
>> So the bottom line here is that we're best off to avoid %.*s because
>> it may fail if the string contains data that isn't validly encoded
>> according to libc's idea of the prevailing encoding.
>
> Yep. Immediate precisions like %.10s trigger the bug as effectively as
%.*s,
> so tarCreateHeader() [_tarWriteHeader() in 9.2 and earlier] is also
affected.
> Switching to strlcpy(), as attached, fixes the bug while simplifying the
code.
> The bug symptom is error 'pg_basebackup: unrecognized link indicator "0"'
when
> the name of a file in the data directory is not a valid multibyte string.
>
>
> Commit 6dd9584 introduced a new use of .*s, to pg_upgrade. It works
reliably
> for now, because it always runs in the C locale. pg_upgrade never calls
> set_pglocale_pgservice() or otherwise sets its permanent locale. It
would be
> natural for us to fix that someday, at which point non-ASCII database
names
> would perturb this status output.

I caught up the following places that need attention on top of the 4 ones
in tar.c:
src/backend/nodes/read.c:
elog(ERROR, "unrecognized integer: \"%.*s\"",
src/backend/nodes/read.c:
elog(ERROR, "unrecognized OID: \"%.*s\"",
src/backend/nodes/read.c: elog(ERROR,
"unrecognized token: \"%.*s\"", tok_len, token);
src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized token:
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized token:
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized
integer: \"%.*s\"", length, token);
src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized boolop
\"%.*s\"", length, token);
src/backend/nodes/readfuncs.c: elog(ERROR, "badly formatted node
string \"%.32s\"...", token);
src/backend/tsearch/wparser_def.c: * Use of %.*s here is a bit risky
since it can misbehave if the data is
src/backend/tsearch/wparser_def.c: fprintf(stderr, "parsing
\"%.*s\"\n", len, str);
src/backend/tsearch/wparser_def.c: /* See note above about %.*s */
src/backend/tsearch/wparser_def.c: fprintf(stderr, "parsing copy of
\"%.*s\"\n", prs->lenstr, prs->str);
src/backend/utils/adt/datetime.c: * Note: the uses
of %.*s in this function would be risky if the
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/backend/utils/adt/datetime.c: /* %.*s is safe
since all our tokens are ASCII */
src/backend/utils/adt/datetime.c: elog(LOG, "token
too long in %s table: \"%.*s\"",
src/interfaces/ecpg/ecpglib/error.c: /* %.*s is safe here as long as
sqlstate is all-ASCII */
src/interfaces/ecpg/ecpglib/error.c: ecpg_log("raising sqlstate %.*s
(sqlcode %ld): %s\n",
src/interfaces/ecpg/pgtypeslib/dt_common.c: * Note:
the uses of %.*s in this function would be risky if the
src/interfaces/ecpg/pgtypeslib/dt_common.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/interfaces/ecpg/pgtypeslib/dt_common.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/interfaces/ecpg/pgtypeslib/dt_common.c:
sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);
src/bin/pg_basebackup/pg_basebackup.c:
ngettext("%*s/%s kB (%d%%), %d/%d tablespace (%s%-*.*s)",
src/bin/pg_basebackup/pg_basebackup.c:
"%*s/%s kB (%d%%), %d/%d tablespaces (%s%-*.*s)",
src/bin/pg_upgrade/util.c: printf("
%s%-*.*s\r",

> It would be good to purge the code of precisions on "s" conversion
specifiers,
> then Assert(!pointflag) in fmtstr() to catch new introductions. I won't
plan
> to do it myself, but it would be a nice little defensive change.

This sounds like a good protection idea, but as it impacts existing backend
code relying in sprintf's port version we should only do the assertion in
HEAD in my opinion, and mention it in the release notes of the next major
version at the time a patch in this area is applied. I guess that we had
better backpatch the places using .*s though on back-branches.
--
Michael

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rémi Cura 2015-06-02 08:24:09 Re: Python 3.2 XP64 and Numpy...
Previous Message Noah Misch 2015-06-02 05:21:21 Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2015-06-02 09:43:06 Re: checkpointer continuous flushing
Previous Message Pavel Stehule 2015-06-02 07:11:12 Re: auto_explain sample rate