Can we make pg_strcasecmp(), pg_tolower(), pg_toupper() plain ASCII semantics?

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Can we make pg_strcasecmp(), pg_tolower(), pg_toupper() plain ASCII semantics?
Date: 2025-10-20 21:02:47
Message-ID: b2a9bec4d9fb7407967e3c4b762b990155a17340.camel@j-davis.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

pg_strcasecmp(), etc., have a dependency on LC_CTYPE, which means a
dependency on setlocale(). I'd like to eliminate those dependencies in
the backend because they cause significant annoyance, especially when
using non-libc providers.

Right now, these functions are effectively very close to plain-ascii
semantics. If the character is in ASCII range, then it only folds
characters A..Z. If using a multibyte encoding, any other byte is part
of a multibyte sequence, so the behavior of tolower() is undefined, and
I believe usually returns 0.

So the only time tolower() matters is when using a single-byte encoding
and folding a character outside the ASCII range.

Most of the callers seem to use these functions in a context that only
cares about ASCII, anyway.

There are a few callers where it matters, such as the implementations
of UPPER()/LOWER()/INITCAP() and LIKE. Those already need special
cases, so it's easy to inline them and make use of the pg_locale_t
object, thus avoiding the dependency on the global LC_CTYPE.

There's a comment at the top of the file saying:

NB: this code should match downcase_truncate_identifier() in
scansup.c.

but I don't see call sites where that's likely to matter. I'd like to
do something about downcase_identifier() as well, but that has more
serious compatibility issues if someone is affected, so needs a bit
more care. Also, given that downcase_identifier checks for a single
byte encoding and these other functions do not, I don't think there's
any guarantee that they are identical in behavior.

While I can imagine that the tolower() call may have been useful at one
time, the fact that it doesn't work for UTF-8 makes me think it's not
widely relied-upon.

Am I missing something? Perhaps it matters for code outside the
backend? 

Attached is a patch to remove the tolower() calls from pgstrcasecmp.c,
and fix up the few call sites where it's needed.

Regards,
Jeff Davis

Attachment Content-Type Size
v1-0001-Remove-tolower-call-from-pgstrcasecmp.c-functions.patch text/x-patch 6.5 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2025-10-20 21:08:21 Re: Add \pset options for boolean value display
Previous Message Nathan Bossart 2025-10-20 20:52:16 Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats()