Re: Re: Toast,bytea, Text -blob all confusing

From: "Joe Conway" <joseph(dot)conway(at)home(dot)com>
To: "Alex Pilosov" <alex(at)acecape(dot)com>, "Lincoln Yeoh" <lyeoh(at)pop(dot)jaring(dot)my>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Toast,bytea, Text -blob all confusing
Date: 2001-08-29 17:36:25
Message-ID: 013201c130b1$20d7e910$0705a8c0@jecw2k1
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > >For bytea, follow this rule: to escape a null character, use this:
> > >'\\0'. To escape a backslash, use this: '\\\\'.
> > >
> > >Same idea to unescape data.
> >
> > Are there other characters that need to be escaped? I suspect there are
> > more characters that need to be escaped - ctrl chars? single quotes?.
Why
> > four backslashes for one? Is there a definitive documentation anywhere
for
> > what bytea is _supposed_ (not what it might actually be) to be and how
it
> > is to be handled?
>
> Yes, sorry for being unclear on this one. Here's a more detailed
> explanation: Bytea is just a stream of data. On input, it must follow C
> escaping conventions, on output, it will be escaped using C escaping
> conventions.
>
> However, there's a trap: before things get to bytea input handler, they
> are first processed by postgresql general parser. Hence, the string \\0
> given from psql will become \0 when it gets to bytea input handler. String
> \\\\ will become \\. All non-printable characters must be escaped like
> this: \\(octal of character), for ex, chr(255) must be presented as \\377.
> (If you want, you can also use this as an alternative and more generic way
> to escape a backslash, \\134). Single quote must be escaped either as \\47
> or as \'. Note the single backslash. Why only one? Because bytea parser
> doesn't care about single quotes and you only need to escape it for the
> postgresql parser...
>
> So, just keep in mind the double-parsing of input and you'll be safe.
>
> > Also why wouldn't escaping stuff like this work with TEXT then? If a
null
> > is going to be backslash backslash zero, and come out the same way, it
sure
> > looks like TEXT to me :). OK so there's this thing about storage. So
maybe
> Because text is null-terminated, can't have a null inside.
>
> > I could save a byte by just converting nulls to backslash zero and real
> > backslashes to backslash backslash. Tada.
> If you do that, you'll break ordering/comparison. Bytea in memory is
> stored EXACTLY the way input string was, without any escaping, hence, all
> comparisons will be correct ( '\\0'::bytea is less than '\\1'::bytea).
>
> With your representation, comparisons will fail, because in memory, data
> is escaped using some escaping convention that you made up.
>
> > OK it's probably not the same, but having to put four backslashes when
two
> > should be enough to quote one makes me rather puzzled and uneasy.
> Double parsing, hence double escaping.

Great explanation Alex --thanks! I'll add a bit:

I've done about 400,000 inserts and subsequent queries to verify that, from
PHP at least, only four charaters need to be escaped. The inserts were 20
byte strings gotten by concatenating some known text with a counter in a
loop, and then producing a SHA-1 hash. This produces very uniformly
distributed binary data. Then I ran the same loop, except I queried for the
inserted strings. I'm quite confident from this experiment that binary can
reliably be inserted via standard SQL when these four characters are
escaped. Here's the run down:

\\000 First slash is consumed by the general parser, leaving \000 for the
byteain function. If you only use one '\', the general parser converts the
character into a true '\0' byte, and the byteain function sees this byte as
the string terminator and stops. This causes the input string to be
truncated (which seems to confuse many people).

\\012 In my early tests 0x0a (LF) was getting converted to 0x20 (space).
I think this was happening during PHP's parsing, but I'm still not sure.
I'll dig into this some more later.

\\047 As Alex mentioned, the byteain function doesn't treat this as a
special character, but of course the general parser does as this is a single
quote. It also works fine to escape it as \', I just prefer to use all
octals.

\\134 Both the general parser and the byteain function see this (a single
\) as the special escape character. Therefore the general parser turns \\\\
into \\, and the byteain function turns \\ into \ for actual storage. Again,
I prefer to use the octal representation instead.

I hope this helps reduce the concerns and confusion over bytea. If anyone
can help explain why my linefeeds were getting converted to spaces, all the
mysteries would be explained ;-)

-- Joe

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2001-08-29 21:09:46 Re: NetBSD 1.5.1(HP300)
Previous Message Jon Lapham 2001-08-29 17:27:49 Odd rule behavior?