Re: BUG #16236: Invalid escape encoding

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stéphane Campinas <stephane(dot)campinas(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16236: Invalid escape encoding
Date: 2020-01-30 15:29:39
Message-ID: 10420.1580398179@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

[ please keep the list cc'd ]

=?utf-8?B?U3TDqXBoYW5l?= Campinas <stephane(dot)campinas(at)gmail(dot)com> writes:
> myDatabaseName=# select encode('\x00017F80', 'escape');
> encode
> ------------------
> \000\x01\x7F\200

> If I understand correctly, with the input "\x00017F80", I get the
> outputted value above because:
> - "00" is converted to "\000"
> - "01" and "7F" get converted to "\x01" and "\x7F" respectively as they
> are not 0 or a high-bit-set value
> - "80" is converted to "\200" since it is a high-bit-set value

The point here is that the encode function is only doing the first
and last of those things. It lets the 01 and 7F bytes through
as-is, because the text data type can store and transport those
just fine. It's psql's table-printing code that is deciding that
those bytes are nonprintable and then choosing to render them in
the \x01 style. (The large distance between those bits of code
helps to explain the inconsistency of style.)

> Do you know why there is this distinction between high-bit-set values
> and other non-printable characters ?

Probably, whoever wrote the encode-as-escape code didn't see a need
to escape anything that type text could store without it. That code's
old enough that it might predate psql's decision to render control
characters this way, too. (Type text won't store zero bytes, and
it will only accept high-bit-set bytes if they form part of a
validly encoded character, which limits the allowed sequences if the
database encoding is, say, UTF8. So those cases *have* to be escaped
in order to turn any valid bytea into a valid text object.)

There's certainly an argument to be made that it'd be more friendly
for encode() to escape these other byte values as well. But the
code is operating as designed.

> First, the following is strange: I cannot decode what the encode method
> returned
> myDatabaseName=# select encode('\x00017F80', 'escape');
> encode
> ------------------
> \000\x01\x7F\200
> (1 row)
> myDatabaseName=# select decode('\000\x01\x7F\200', 'escape');
> ERROR: invalid input syntax for type bytea

That's because that's *not* what encode() returned, it's just how
psql chose to print it. One way to write what encode() really
returned is

regression=# select octet_length(E'\\000\x01\x7F\\200'::text);
octet_length
--------------
10
(1 row)

regression=# select decode(E'\\000\x01\x7F\\200'::text, 'escape');
decode
------------
\x00017f80
(1 row)

> Second, as I was poking around the code, I found out about the
> "bytea_output". If I set it to "escape", I still get hexadecimals. Is
> that expected ?

Yes, because encode()'s output is type text and hence not subject
to that setting. If you were looking for a way to control what
psql does with these bytes, you'd have to look into its commands,
probably \pset. (I don't think there is a way to control it,
but if there was, that's where we'd put it.)

regards, tom lane

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Walker, Jared (Contractor) 2020-01-30 16:21:29 RE: [EXTERNAL] Re: BUG #16131: pg_upgrade 9 -> 11.6 and a database is missing
Previous Message Daniel Gustafsson 2020-01-30 14:12:30 Re: Moderate Error in German help of pg_ctl