Quick Links

Re: Pre-proposal: unicode normalized text

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pre-proposal: unicode normalized text
Date:	2023-10-07 01:18:01
Message-ID:	96c0173c5156d365e132ec29e4873237be565743.camel@j-davis.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote:
> > At minimum I think we need to have some internal functions to check
> > for
> > unassigned code points. That belongs in core, because we generate
> > the
> > unicode tables from a specific version.
>
> That's a good idea.

Patch attached.

I added a new perl script to parse UnicodeData.txt and generate a
lookup table (of ranges, which can be binary-searched).

The C entry point does the same thing as u_charType(), and I also
matched the enum numeric values for convenience. I didn't use
u_charType() because I don't think this kind of unicode functionality
should depend on ICU, and I think it should match other Postgres
Unicode functionality.

Strictly speaking, I only needed to know whether it's unassigned or
not, not the general category. But it seemed easy enough to return the
general category, and it will be easier to create other potentially-
useful functions on top of this.

The tests do require ICU though, because I compare with the results of
u_charType().

Regards,
Jeff Davis

Attachment	Content-Type	Size
v1-0001-Internal-functions-for-determining-Unicode-genera.patch	text/x-patch	201.9 KB

In response to

Re: Pre-proposal: unicode normalized text at 2023-10-04 17:16:22 from Robert Haas

Responses

Re: Pre-proposal: unicode normalized text at 2023-10-10 06:44:50 from Peter Eisentraut

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Vik Fearing	2023-10-07 01:35:06	Re: Add support for AT LOCAL
Previous Message	Amit Kapila	2023-10-07 00:19:26	Re: typo in couple of places