Re: proposal: unescape_text function

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: unescape_text function
Date: 2020-12-01 05:43:27
Message-ID: CAFj8pRAOoDk+fR57tj7PitYd=v2eobTz4+vb4-YZ5hPMmxNr1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

po 30. 11. 2020 v 22:56 odesílatel Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
napsal:

>
>
> po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
> napsal:
>
>>
>>
>> po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <
>> peter(dot)eisentraut(at)enterprisedb(dot)com> napsal:
>>
>>> On 2020-11-29 18:36, Pavel Stehule wrote:
>>> >
>>> > I don't really get the point of this function. There is AFAICT no
>>> > function to produce this escaped format, and it's not a recognized
>>> > interchange format. So under what circumstances would one need to
>>> > use this?
>>> >
>>> >
>>> > Some corporate data can be in CSV format with escaped unicode
>>> > characters. Without this function it is not possible to decode these
>>> > files without external application.
>>>
>>> I would like some supporting documentation on this. So far we only have
>>> one stackoverflow question, and then this implementation, and they are
>>> not even the same format. My worry is that if there is not precise
>>> specification, then people are going to want to add things in the
>>> future, and there will be no way to analyze such requests in a
>>> principled way.
>>>
>>>
>> I checked this and it is "prefix backslash-u hex" used by Java,
>> JavaScript or RTF -
>> https://billposer.org/Software/ListOfRepresentations.html
>>
>> In some languages (Python), there is decoder "unicode-escape". Java has
>> a method escapeJava, for conversion from unicode to ascii. I can imagine so
>> these data are from Java systems exported to 8bit strings - so this
>> implementation can be accepted as referential. This format is used by
>> https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html
>> tool too.
>>
>> Postgres can decode this format too, and the patch is based on Postgres
>> implementation. I just implemented a different interface.
>>
>> Currently decode function does only text->bytea transformation. Maybe a
>> more generic function "decode_text" and "encode_text" for similar cases can
>> be better (here we need text->text transformation). But it looks like
>> overengineering now.
>>
>> Maybe we introduce new encoding "ascii" and we can implement new
>> conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most
>> clean solution. What do you think about it?
>>
>
> a better name of new encoding can be "unicode-escape" than "ascii". We use
> "to_ascii" function for different use case.
>
> set client_encoding to unicode-escape;
> copy tab from xxx;
> ...
>
> but it doesn't help when only a few columns from the table are in
> unicode-escape format.
>
>
probably the most complete solution can be from two steps:

1. introducing new encoding - "ascii_unicode_escape" with related
conversions
2. introducing two new functions - text_escape and text_unescape with two
parameters - source text and conversion name

select text_convert_to('Тимати', 'ascii_unicode_escape')
\u0422\u0438\u043c\u0430\u0442\u0438 .. result is text

select text_convert_from('\u0422\u0438\u043c\u0430\u0442\u0438',
'ascii_unicode_escape')
┌──────────┐
│ ?column? │
╞══════════╡
│ Тимати │
└──────────┘
(1 row)

>
>
>> Regards
>>
>> Pavel
>>
>>
>>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hou, Zhijie 2020-12-01 05:45:35 RE: [PATCH] Keeps tracking the uniqueness with UniqueKey
Previous Message Justin Pryzby 2020-12-01 05:43:08 Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly