From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Can we make pg_strcasecmp(), pg_tolower(), pg_toupper() plain ASCII semantics? |
Date: | 2025-10-20 21:02:47 |
Message-ID: | b2a9bec4d9fb7407967e3c4b762b990155a17340.camel@j-davis.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
pg_strcasecmp(), etc., have a dependency on LC_CTYPE, which means a
dependency on setlocale(). I'd like to eliminate those dependencies in
the backend because they cause significant annoyance, especially when
using non-libc providers.
Right now, these functions are effectively very close to plain-ascii
semantics. If the character is in ASCII range, then it only folds
characters A..Z. If using a multibyte encoding, any other byte is part
of a multibyte sequence, so the behavior of tolower() is undefined, and
I believe usually returns 0.
So the only time tolower() matters is when using a single-byte encoding
and folding a character outside the ASCII range.
Most of the callers seem to use these functions in a context that only
cares about ASCII, anyway.
There are a few callers where it matters, such as the implementations
of UPPER()/LOWER()/INITCAP() and LIKE. Those already need special
cases, so it's easy to inline them and make use of the pg_locale_t
object, thus avoiding the dependency on the global LC_CTYPE.
There's a comment at the top of the file saying:
NB: this code should match downcase_truncate_identifier() in
scansup.c.
but I don't see call sites where that's likely to matter. I'd like to
do something about downcase_identifier() as well, but that has more
serious compatibility issues if someone is affected, so needs a bit
more care. Also, given that downcase_identifier checks for a single
byte encoding and these other functions do not, I don't think there's
any guarantee that they are identical in behavior.
While I can imagine that the tolower() call may have been useful at one
time, the fact that it doesn't work for UTF-8 makes me think it's not
widely relied-upon.
Am I missing something? Perhaps it matters for code outside the
backend?
Attached is a patch to remove the tolower() calls from pgstrcasecmp.c,
and fix up the few call sites where it's needed.
Regards,
Jeff Davis
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Remove-tolower-call-from-pgstrcasecmp.c-functions.patch | text/x-patch | 6.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Álvaro Herrera | 2025-10-20 21:08:21 | Re: Add \pset options for boolean value display |
Previous Message | Nathan Bossart | 2025-10-20 20:52:16 | Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats() |