Re: PATCH: Add uri percent-encoding for binary data

From: Anders Åstrand <anders(at)449(dot)se>
To: Isaac Morland <isaac(dot)morland(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: Add uri percent-encoding for binary data
Date: 2019-10-08 18:07:02
Message-ID: CAPwPebuhhnhr6KC45uEVBKwQsa44SdoLozGQDXdD=gEKOto1OA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 7, 2019 at 11:38 PM Isaac Morland <isaac(dot)morland(at)gmail(dot)com> wrote:
>
> On Mon, 7 Oct 2019 at 03:15, Anders Åstrand <anders(at)449(dot)se> wrote:
>>
>> Hello
>>
>> Attached is a patch for adding uri as an encoding option for
>> encode/decode. It uses what's called "percent-encoding" in rfc3986
>> (https://tools.ietf.org/html/rfc3986#section-2.1).
>>
>> The background for this patch is that I could easily build urls in
>> plpgsql, but doing the actual encoding of the url parts is painfully
>> slow. The list of available encodings for encode/decode looks quite
>> arbitrary to me, so I can't see any reason this one couldn't be in
>> there.
>>
>> In modern web scenarios one would probably most likely want to encode
>> the utf8 representation of a text string for inclusion in a url, in
>> which case correct invocation would be ENCODE(CONVERT_TO('some text in
>> database encoding goes here', 'UTF8'), 'uri'), but uri
>> percent-encoding can of course also be used for other text encodings
>> and arbitrary binary data.
>
>
> This seems like a useful idea to me. I've used the equivalent in Python and it provides more options:
>
> https://docs.python.org/3/library/urllib.parse.html#url-quoting
>
> I suggest reviewing that documentation there, because there are a few details that need to be checked carefully. Whether or not space should be encoded as plus and whether certain byte values should be exempt from %-encoding is something that depends on the application. Unfortunately, as far as I can tell there isn't a single version of URL encoding that satisfies all situations (thus explaining the complexity of the Python implementation). It might be feasible to suppress some of the Python options (I'm wondering about the safe= parameter) but I'm pretty sure you at least need the equivalent of quote and quote_plus.

Thanks a lot for your reply!

I agree that some (but not all) of the options available to that
python lib could be helpful for developers wanting to build urls
without having to encode the separate parts of it and stitching it
together, but not necessary for this patch to be useful. For generic
uri encoding the slash (/) must be percent encoded, because it has
special meaning in the standard. Some other extra characters may
appear unencoded though depending on context, but it's generally safer
to just encode them all and not hope that the encoder will know about
the context and skip over certain characters.

This does bring up an interesting point however. Maybe decode should
validate that only characters that are allowed unencoded appear in the
input?

Luckily, the plus-encoding of spaces are not part of the uri standard
at all but instead part of the format referred to as
application/x-www-form-urlencoded data. Fortunately that format is
close to dying now that forms more often post json.

Regards,
Anders

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2019-10-08 18:08:40 Re: v12 and pg_restore -f-
Previous Message Stephen Frost 2019-10-08 18:03:02 Re: Non-null values of recovery functions after promote or crash of primary