| From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
|---|---|
| To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
| Cc: | Peter Eisentraut <peter(at)eisentraut(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Remaining dependency on setlocale() |
| Date: | 2025-11-26 01:50:54 |
| Message-ID: | 83F6E6A9-FF29-41C9-9E15-EA5360E912D0@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> On Nov 26, 2025, at 01:20, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
>
> New series attached with only these changes and a rebase.
>
> Regards,
> Jeff Davis
>
> <v10-0001-Inline-pg_ascii_tolower-and-pg_ascii_toupper.patch><v10-0002-Add-define-for-UNICODE_CASEMAP_BUFSZ.patch><v10-0003-Change-some-callers-to-use-pg_ascii_toupper.patch><v10-0004-Allow-pg_locale_t-APIs-to-work-when-ctype_is_c.patch><v10-0005-Make-regex-max_chr-depend-on-encoding-not-provid.patch><v10-0006-Fix-inconsistency-between-ltree_strncasecmp-and-.patch><v10-0007-Remove-char_tolower-API.patch><v10-0008-Use-multibyte-aware-extraction-of-pattern-prefix.patch><v10-0009-fuzzystrmatch-use-pg_ascii_toupper.patch><v10-0010-downcase_identifier-use-method-table-from-locale.patch><v10-0011-Control-LC_COLLATE-with-GUC.patch>
I continued reviewing 0005-0008.
0005 - no comment. The change looks correct to me. Before the patch, pg_regex_locale->ctype->max_chr <= MAX_SIMPLE_CHR should be the more common cases, with the patch, MAX_SIMPLE_CHR >= UCHAR_MAX should be the more common cases, thus there should be not a behavior change.
5 - 0006
```
@@ -77,10 +77,37 @@ compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *,
int
ltree_strncasecmp(const char *a, const char *b, size_t s)
{
- char *al = str_tolower(a, s, DEFAULT_COLLATION_OID);
- char *bl = str_tolower(b, s, DEFAULT_COLLATION_OID);
+ static pg_locale_t locale = NULL;
+ size_t al_sz = s + 1;
+ char *al = palloc(al_sz);
+ size_t bl_sz = s + 1;
+ char *bl = palloc(bl_sz);
+ size_t needed;
int res;
+ if (!locale)
+ locale = pg_database_locale();
+
+ needed = pg_strfold(al, al_sz, a, s, locale);
+ if (needed + 1 > al_sz)
+ {
+ /* grow buffer if needed and retry */
+ al_sz = needed + 1;
+ al = repalloc(al, al_sz);
+ needed = pg_strfold(al, al_sz, a, s, locale);
+ Assert(needed + 1 <= al_sz);
+ }
+
+ needed = pg_strfold(bl, bl_sz, b, s, locale);
+ if (needed + 1 > bl_sz)
+ {
+ /* grow buffer if needed and retry */
+ bl_sz = needed + 1;
+ bl = repalloc(bl, bl_sz);
+ needed = pg_strfold(bl, bl_sz, b, s, locale);
+ Assert(needed + 1 <= bl_sz);
+ }
+
res = strncmp(al, bl, s);
pfree(al);
```
I do think the new implementation has some problem.
* The retry logic implies that a single-byte char may become multiple bytes after folding, otherwise retry is not needed because you have allocated s+1 bytes for dest buffers. From this perspective, we should use two needed variables: neededA and neededB, if neededA != neededB, then the two strings are different; if neededA == neededB, then we should be perform strncmp, but here we should pass neededA (or neededB as they are identical) to strncmp(al, bl, neededA).
* Based on the logic you implemented in 0004, first pg_strfold() has copied as many chars as possible to dest buffer, so when retry, ideally we can should resume instead of start over. However, if single-byte->multi-byte folding happens, we have no information to decide from where to resume. From this perspective, in 0004, do we really need to take the try-the-best strategy for strlower_c()? If there are some other use cases that require data to be placed in dest buffer even if dest buffer doesn’t have enough space, then my patch [1] of changing strlower_libc_sb() should be considered.
6 - 0007
```
/*
- * For efficiency reasons, in the single byte case we don't call lower()
- * on the pattern and text, but instead call SB_lower_char on each
- * character. In the multi-byte case we don't have much choice :-(. Also,
- * ICU does not support single-character case folding, so we go the long
- * way.
+ * For efficiency reasons, in the C locale we don't call lower() on the
+ * pattern and text, but instead call SB_lower_char on each character.
*/
```
SB_lower_char should be changed to C_IMatchText.
7 - 0007
```
/* setup to compile like_match.c for single byte case insensitive matches */
-#define MATCH_LOWER(t, locale) SB_lower_char((unsigned char) (t), locale)
+#define MATCH_LOWER
```
I think the comment should be updated accordingly, like “for ILIKE in the C locale”.
8 - 0008
```
+ /* for text types, use pg_wchar; for BYTEA, use char */
if (typeid != BYTEAOID)
{
- patt = TextDatumGetCString(patt_const->constvalue);
- pattlen = strlen(patt);
+ text *val = DatumGetTextPP(patt_const->constvalue);
+ pg_wchar *wpatt;
+ pg_wchar *wmatch;
+ char *match;
+
+ patt = VARDATA_ANY(val);
+ pattlen = VARSIZE_ANY_EXHDR(val);
+ wpatt = palloc((pattlen + 1) * sizeof(pg_wchar));
+ wmatch = palloc((pattlen + 1) * sizeof(pg_wchar));
+ pg_mb2wchar_with_len(patt, wpatt, pattlen);
+
+ match = palloc(pattlen + 1);
```
* match is allocated with pattlen+1 bytes, is it long enough to hold pattlen multiple-byte chars?
* match is not freed, but looks like it should be:
*prefix_const = string_to_const(match, typeid);
-> string_to_datum(str, datatype);
-> CStringGetTextDatum(str);
-> cstring_to_text(s)
-> cstring_to_text_with_len(s, strlen(s));
-> *result = (text *) palloc(len + VARHDRSZ);
So, match has been copied, it should be freed.
9 - 0008
```
- }
+ /* Backslash escapes the next character */
+ if (patt[pos] == '\\')
+ {
+ pos++;
+ if (pos >= pattlen)
+ break;
+ }
- match[match_pos] = '\0';
+ match[match_pos++] = pos;
+ }
```
Should “pos” be “part[pos]” assigning to match[match_pos++]?
I will review the rest 3 commits tomorrow.
[1] https://postgr.es/m/CAEoWx2m9mUN397neL=p9x0vaVcj5EGiKD53F1MNTwTDXizxiaA@mail.gmail.com
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Michael Paquier | 2025-11-26 02:06:10 | Re: Extended Statistics set/restore/clear functions. |
| Previous Message | Masahiko Sawada | 2025-11-26 00:29:05 | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |