Re: proposal: unescape_text function

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: unescape_text function
Date: 2020-11-30 21:15:32
Message-ID: CAFj8pRCPEnBZushTEB4VQjpZJyK6czS0AFkP8e6wZQ22jpV2_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <
peter(dot)eisentraut(at)enterprisedb(dot)com> napsal:

> On 2020-11-29 18:36, Pavel Stehule wrote:
> >
> > I don't really get the point of this function. There is AFAICT no
> > function to produce this escaped format, and it's not a recognized
> > interchange format. So under what circumstances would one need to
> > use this?
> >
> >
> > Some corporate data can be in CSV format with escaped unicode
> > characters. Without this function it is not possible to decode these
> > files without external application.
>
> I would like some supporting documentation on this. So far we only have
> one stackoverflow question, and then this implementation, and they are
> not even the same format. My worry is that if there is not precise
> specification, then people are going to want to add things in the
> future, and there will be no way to analyze such requests in a
> principled way.
>
>
I checked this and it is "prefix backslash-u hex" used by Java, JavaScript
or RTF - https://billposer.org/Software/ListOfRepresentations.html

In some languages (Python), there is decoder "unicode-escape". Java has a
method escapeJava, for conversion from unicode to ascii. I can imagine so
these data are from Java systems exported to 8bit strings - so this
implementation can be accepted as referential. This format is used by
https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html
tool too.

Postgres can decode this format too, and the patch is based on Postgres
implementation. I just implemented a different interface.

Currently decode function does only text->bytea transformation. Maybe a
more generic function "decode_text" and "encode_text" for similar cases can
be better (here we need text->text transformation). But it looks like
overengineering now.

Maybe we introduce new encoding "ascii" and we can implement new
conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most
clean solution. What do you think about it?

Regards

Pavel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-11-30 21:34:29 Re: support IncrementalSortPath type in outNode()
Previous Message Drouvot, Bertrand 2020-11-30 21:07:29 [BUG] orphaned function