Re: Small patch to improve safety of utf8_to_unicode().

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Small patch to improve safety of utf8_to_unicode().
Date: 2025-12-15 23:34:41
Message-ID: C09D685C-FA5C-47E1-8EDB-B1FDD635309F@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Dec 16, 2025, at 04:23, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Sun, 2025-12-14 at 07:22 +0800, Chao Li wrote:
>> This patch adds a length check to utf8_to_unicode(), I think which is
>> where “safety” comes from. Can you please add an a little bit more to
>> the commit message instead of only saying “improve safety”.
>
> Right: it does not read past pg_mblen(), nor past the supplied length.
>
> We could go further and check for valid continuation bytes, but the
> other routines don't do that.
>
>> It also deleted two redundant function declarations from pg_wchar.h,
>> which may also worth a quick note in the commit message.
>
> I committed that as an independent change and removed it from this
> patch.
>
>> + /* Assume byte sequence has not been broken. */
>> + c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN);
>> ```
>>
>> Here we need an empty line above the new comment.
>
> Done, and I expanded the comment to explain why it's safe.
>
>> pg_utf_dsplen(const unsigned char *s)
>> {
>> - return ucs_wcwidth(utf8_to_unicode(s));
>> + /* trust that input is not a truncated byte sequence */
>> + return ucs_wcwidth(utf8_to_unicode(s,
>> MAX_MULTIBYTE_CHAR_LEN));
>> }
>> ```
>>
>> For the new comment, as a code reader, I wonder why we “trust” that?
>
> We could use strlen(), but I was concerned that it might be used for
> string fragments that aren't NUL-terminated, because it's intended for
> a single character. A caller might reasonably assume that it wouldn't
> read past pg_mblen().
>
> So I changed the comment slightly to just say that it requires the
> input is a valid UTF-8 sequence. Let me know if you have another
> suggestion.
>
> Regards,
> Jeff Davis
>
>
> <v2-0001-Make-utf8_to_unicode-safer.patch>

V2 LGTM.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2025-12-15 23:39:28 Re: oid2name : add objects file path
Previous Message Tom Lane 2025-12-15 23:31:38 Re: PRI?64 vs Visual Studio (2022)