| From: | Peter Eisentraut <peter(at)eisentraut(dot)org> |
|---|---|
| To: | Jeff Davis <pgsql(at)j-davis(dot)com>, Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
| Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Remaining dependency on setlocale() |
| Date: | 2025-12-17 10:39:05 |
| Message-ID: | dd0cdd1f-e786-426e-b336-1ffa9b2f1fc6@eisentraut.org |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 12.12.25 21:11, Jeff Davis wrote:
>> case '\xc7': /* C with cedilla */
>>
>> so the premise that "fuzzystrmatch is designed for ASCII" does not
>> appear to be correct. Needs more analysis.
>>
>> (But apparently it's not multibyte aware at all, so I don't know what
>> to
>> do about that.)
> I didn't notice that, thank you. Agreed, we need a bit more discussion
> around this case as well as soundex().
Soundex is an ASCII-only algorithm, there is no expectation that the
algorithm does anything useful with non-ASCII characters, and it doesn't
do so now. So I think using pg_ascii_toupper() is ok. (Users could for
example use unaccent to preprocess text.)
One might wonder if the presence of non-ASCII characters should be an
error, but that doesn't have to be the subject of this thread. I
noticed that the Wikipedia page for Soundex even calls out PostgreSQL
for doing things slightly different than everyone else, but I haven't
studied the details.
For Metaphone, I found the reference implementation linked from its
Wikipedia page, and it looks like our implementation is pretty closely
aligned to that. That reference implementation also contains the
C-with-cedilla case explicitly. The correct fix here would probably be
to change the implementation to work on wide characters. But I think
for the moment you could try a shortcut like, use pg_ascii_toupper(),
but if the encoding is LATIN1 (or LATIN9 or whichever other encodings
also contain C-with-cedilla at that code point), then explicitly
uppercase that one as well. This would preserve the existing behavior.
Note that the documentation calls out: "At present, the soundex,
metaphone, dmetaphone, and dmetaphone_alt functions do not work well
with multibyte encodings (such as UTF-8)."
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Zsolt Parragi | 2025-12-17 10:44:25 | Re: Periodic authorization expiration checks using GoAway message |
| Previous Message | Rahila Syed | 2025-12-17 10:36:58 | Re: Segmentation fault on proc exit after dshash_find_or_insert |