Re: Small patch to improve safety of utf8_to_unicode().

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Small patch to improve safety of utf8_to_unicode().
Date: 2025-12-13 23:22:28
Message-ID: 541F240E-94AD-4D65-9794-7D6C316BC3FF@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Dec 13, 2025, at 07:24, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> Attached.
>
>
> <v1-0001-Make-utf8_to_unicode-safer.patch>

This patch adds a length check to utf8_to_unicode(), I think which is where “safety” comes from. Can you please add an a little bit more to the commit message instead of only saying “improve safety”. It also deleted two redundant function declarations from pg_wchar.h, which may also worth a quick note in the commit message.

The code changes all look good to me. Only nitpicks are:

1
```
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
index 07f895ae2bf..47bd2814460 100644
--- a/contrib/fuzzystrmatch/daitch_mokotoff.c
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -401,7 +401,8 @@ read_char(const unsigned char *str, int *ix)

/* Decode UTF-8 character to ISO 10646 code point. */
str += *ix;
- c = utf8_to_unicode(str);
+ /* Assume byte sequence has not been broken. */
+ c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN);
```

Here we need an empty line above the new comment.

2
```
diff --git a/src/common/wchar.c b/src/common/wchar.c
index a4bc29921de..c113cadf815 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -661,7 +661,8 @@ ucs_wcwidth(pg_wchar ucs)
static int
pg_utf_dsplen(const unsigned char *s)
{
- return ucs_wcwidth(utf8_to_unicode(s));
+ /* trust that input is not a truncated byte sequence */
+ return ucs_wcwidth(utf8_to_unicode(s, MAX_MULTIBYTE_CHAR_LEN));
}
```

For the new comment, as a code reader, I wonder why we “trust” that? To me, it more feels like because of lacking length information, we have to trust. I would like this comment to be enhanced a little bit with more information.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2025-12-14 00:28:36 Re: Remove header lock BufferGetLSNAtomic() on architectures with 64 bit atomic operations
Previous Message Tom Lane 2025-12-13 21:25:50 Re: Making jsonb_agg() faster