Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2023-12-21 22:24:01
Message-ID: 7774b3a64f51b3375060c29871cf2b02b3e85dab.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2023-12-20 at 16:29 -0800, Jeremy Schneider wrote:
> found some more. here's my running list of everything user-facing I
> see
> in core PG code so far that might involve case:
>
> * upper/lower/initcap
> * regexp_*() and *_REGEXP()
> * ILIKE, operators ~* !~* ~~ !~~ ~~* !~~*
> * citext + replace(), split_part(), strpos() and translate()
> * full text search - everything is case folded
> * unaccent? not clear to me whether CTYPE includes accent folding

No, ctype has nothing to do with accents as far as I can tell. I don't
know if I'm using the right terminology, but I think "case" is a
variant of a character whereas "accent" is a modifier/mark, and the
mark is a separate concept from the character itself.

> * ltree
> * pg_trgm
> * core PG parser, case folding of relation names

Let's separate it into groups.

(1) Callers that use a collation OID or pg_locale_t:

* collation & hashing
* upper/lower/initcap
* regex, LIKE, formatting
* pg_trgm (which uses regexes)
* maybe postgres_fdw, but might just be a passthrough
* catalog cache (always uses DEFAULT_COLLATION_OID)
* citext (always uses DEFAULT_COLLATION_OID, but probably shouldn't)

(2) A long tail of callers that depend on what LC_CTYPE/LC_COLLATE are
set to, or use ad-hoc ASCII-only semantics:

* core SQL parser downcase_identifier()
* callers of pg_strcasecmp() (DDL, etc.)
* GUC name case folding
* full text search ("mylocale = 0 /* TODO */")
* a ton of stuff uses isspace(), isdigit(), etc.
* various callers of tolower()/toupper()
* some selfuncs.c stuff
* ...

Might have missed some places.

The user impact of a new builtin provider would affect (1), but only
for those actually using the provider. So there's no compatibility risk
there, but it's good to understand what it will affect.

We can, on a case-by-case basis, also consider using the new APIs I'm
proposing for instances of (2). There would be some compatibility risk
there for existing callers, and we'd have to consider whether it's
worth it or not. Ideally, new callers would either use the new APIs or
use the pg_ascii_* APIs.

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-12-21 23:00:26 Re: Built-in CTYPE provider
Previous Message Thomas Munro 2023-12-21 22:05:14 Re: pg_serial bloat