Re: proposal: unescape_text function

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: unescape_text function
Date: 2020-12-02 18:30:39
Message-ID: CAFj8pRC1UufDW45WOFz5rH6uiOTBaWU-sQ5BLkEyeAiV9M6VLA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
napsal:

>
>
> st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut <
> peter(dot)eisentraut(at)enterprisedb(dot)com> napsal:
>
>> On 2020-11-30 22:15, Pavel Stehule wrote:
>> > I would like some supporting documentation on this. So far we only
>> > have
>> > one stackoverflow question, and then this implementation, and they
>> are
>> > not even the same format. My worry is that if there is not precise
>> > specification, then people are going to want to add things in the
>> > future, and there will be no way to analyze such requests in a
>> > principled way.
>> >
>> >
>> > I checked this and it is "prefix backslash-u hex" used by Java,
>> > JavaScript or RTF -
>> > https://billposer.org/Software/ListOfRepresentations.html
>>
>> Heh. The fact that there is a table of two dozen possible
>> representations kind of proves my point that we should be deliberate in
>> picking one.
>>
>> I do see Oracle unistr() on that list, which appears to be very similar
>> to what you are trying to do here. Maybe look into aligning with that.
>>
>
> unistr is a primitive form of proposed function. But it can be used as a
> base. The format is compatible with our "4.1.2.3. String Constants with
> Unicode Escapes".
>
> What do you think about the following proposal?
>
> 1. unistr(text) .. compatible with Postgres unicode escapes - it is
> enhanced against Oracle, because Oracle's unistr doesn't support 6 digits
> unicodes.
>
> 2. there can be optional parameter "prefix" with default "\". But with
> "\u" it can be compatible with Java or Python.
>
> What do you think about it?
>

I thought about it a little bit more, and the prefix specification has not
too much sense (more if we implement this functionality as function
"unistr"). I removed the optional argument and renamed the function to
"unistr". The functionality is the same. Now it supports Oracle convention,
Java and Python (for Python UXXXXXXXX) and \+XXXXXX. These formats was
already supported. The compatibility witth Oracle is nice.

postgres=# select
'Arabic : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' ) ||
'
Chinese : ' || unistr( '\4E2D\6587' ) ||
'
English : ' || unistr( 'English' ) ||
'
French : ' || unistr( 'Fran\00E7ais' ) ||
'
German : ' || unistr( 'Deutsch' ) ||
'
Greek : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) ||
'
Hebrew : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' ) ||
'
Japanese : ' || unistr( '\65E5\672C\8A9E' ) ||
'
Korean : ' || unistr( '\D55C\AD6D\C5B4' ) ||
'
Portuguese : ' || unistr( 'Portugu\00EAs' ) ||
'
Russian : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' ) ||
'
Spanish : ' || unistr( 'Espa\00F1ol' ) ||
'
Thai : ' || unistr( '\0E44\0E17\0E22' )
as unicode_test_string;
┌──────────────────────────┐
│ unicode_test_string │
╞══════════════════════════╡
│ Arabic : العربية ↵│
│ Chinese : 中文 ↵│
│ English : English ↵│
│ French : Français ↵│
│ German : Deutsch ↵│
│ Greek : Ελληνικά ↵│
│ Hebrew : עברית ↵│
│ Japanese : 日本語 ↵│
│ Korean : 한국어 ↵│
│ Portuguese : Português↵│
│ Russian : Русский ↵│
│ Spanish : Español ↵│
│ Thai : ไทย │
└──────────────────────────┘
(1 row)

postgres=# SELECT UNISTR('Odpov\u011Bdn\u00E1 osoba');
┌─────────────────┐
│ unistr │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)

New patch attached

Regards

Pavel

> Pavel
>

Attachment Content-Type Size
unistr.patch text/x-patch 10.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2020-12-02 19:18:08 Re: [HACKERS] [PATCH] Generic type subscripting
Previous Message Tom Lane 2020-12-02 18:02:53 Re: Deprecate custom encoding conversions