Re: Small patch to improve safety of utf8_to_unicode().

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Small patch to improve safety of utf8_to_unicode().
Date: 2026-06-25 04:10:26
Message-ID: 3058CAA1-40E0-4098-893B-615F77CD35E1@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jun 25, 2026, at 05:57, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Wed, 2026-06-24 at 16:44 +0800, Chao Li wrote:
>> There is a compile warning against pg_wchar.h in 0004:
>
> Fixed. I also used a loop in utf8decode() which is slightly smaller,
> which is good if we intend it to be inlined by a lot of callers.
>
> Regards,
> Jeff Davis
>
> <v5-0001-unicode_case.c-defend-against-invalid-UTF8.patch><v5-0002-pg_unicode_fast-fix-final-sigma-logic.patch><v5-0003-unicode_case.c-change-API-to-signal-UTF8-decoding.patch><v5-0004-Validating-iterator-friendly-UTF8-encoder-decoder.patch><v5-0005-unicode_case.c-use-new-utf8encode-utf8decode-APIs.patch>

I just reviewed v5-0001 and got one concern.

In initcap_wbnext(), the new check only verifies that the input has enough bytes:
```
if (wbstate->offset + ulen > wbstate->len)
```

What about an invalid continuation byte, for example "\xCE "? In this case, pg_utf_mblen() sees \xCE, so ulen will be 2. Since there is still one more byte, the length check won't catch the invalid continuation byte \x20, and the code will proceed to utf8_to_unicode().

Looking at utf8_to_unicode():
```
static inline char32_t
utf8_to_unicode(const unsigned char *c)
{
if ((*c & 0x80) == 0)
return (char32_t) c[0];
else if ((*c & 0xe0) == 0xc0)
return (char32_t) (((c[0] & 0x1f) << 6) |
(c[1] & 0x3f));
else if ((*c & 0xf0) == 0xe0)
return (char32_t) (((c[0] & 0x0f) << 12) |
((c[1] & 0x3f) << 6) |
(c[2] & 0x3f));
else if ((*c & 0xf8) == 0xf0)
return (char32_t) (((c[0] & 0x07) << 18) |
((c[1] & 0x3f) << 12) |
((c[2] & 0x3f) << 6) |
(c[3] & 0x3f));
else
return PG_INVALID_CODEPOINT;
}
```

For "\xCE ", it will take this branch:
```
else if ((*c & 0xe0) == 0xc0)
return (char32_t) (((c[0] & 0x1f) << 6) |
(c[1] & 0x3f));
```

This uses the second byte, \x20, without validating. So it looks like the patch prevents reading past the end of the string, but it may not fully defend against invalid UTF-8 sequences.

Am I missing anything?

(I will continue to review 0002 tomorrow.)

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Rahila Syed 2026-06-25 04:49:36 Re: Fix unsafe coding in ResourceOwnerReleaseAll()
Previous Message Bertrand Drouvot 2026-06-25 04:06:21 In core use of RegisterXactCallback() and RegisterSubXactCallback()