| From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
|---|---|
| To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
| Cc: | pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Small patch to improve safety of utf8_to_unicode(). |
| Date: | 2026-06-25 04:10:26 |
| Message-ID: | 3058CAA1-40E0-4098-893B-615F77CD35E1@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> On Jun 25, 2026, at 05:57, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Wed, 2026-06-24 at 16:44 +0800, Chao Li wrote:
>> There is a compile warning against pg_wchar.h in 0004:
>
> Fixed. I also used a loop in utf8decode() which is slightly smaller,
> which is good if we intend it to be inlined by a lot of callers.
>
> Regards,
> Jeff Davis
>
> <v5-0001-unicode_case.c-defend-against-invalid-UTF8.patch><v5-0002-pg_unicode_fast-fix-final-sigma-logic.patch><v5-0003-unicode_case.c-change-API-to-signal-UTF8-decoding.patch><v5-0004-Validating-iterator-friendly-UTF8-encoder-decoder.patch><v5-0005-unicode_case.c-use-new-utf8encode-utf8decode-APIs.patch>
I just reviewed v5-0001 and got one concern.
In initcap_wbnext(), the new check only verifies that the input has enough bytes:
```
if (wbstate->offset + ulen > wbstate->len)
```
What about an invalid continuation byte, for example "\xCE "? In this case, pg_utf_mblen() sees \xCE, so ulen will be 2. Since there is still one more byte, the length check won't catch the invalid continuation byte \x20, and the code will proceed to utf8_to_unicode().
Looking at utf8_to_unicode():
```
static inline char32_t
utf8_to_unicode(const unsigned char *c)
{
if ((*c & 0x80) == 0)
return (char32_t) c[0];
else if ((*c & 0xe0) == 0xc0)
return (char32_t) (((c[0] & 0x1f) << 6) |
(c[1] & 0x3f));
else if ((*c & 0xf0) == 0xe0)
return (char32_t) (((c[0] & 0x0f) << 12) |
((c[1] & 0x3f) << 6) |
(c[2] & 0x3f));
else if ((*c & 0xf8) == 0xf0)
return (char32_t) (((c[0] & 0x07) << 18) |
((c[1] & 0x3f) << 12) |
((c[2] & 0x3f) << 6) |
(c[3] & 0x3f));
else
return PG_INVALID_CODEPOINT;
}
```
For "\xCE ", it will take this branch:
```
else if ((*c & 0xe0) == 0xc0)
return (char32_t) (((c[0] & 0x1f) << 6) |
(c[1] & 0x3f));
```
This uses the second byte, \x20, without validating. So it looks like the patch prevents reading past the end of the string, but it may not fully defend against invalid UTF-8 sequences.
Am I missing anything?
(I will continue to review 0002 tomorrow.)
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Rahila Syed | 2026-06-25 04:49:36 | Re: Fix unsafe coding in ResourceOwnerReleaseAll() |
| Previous Message | Bertrand Drouvot | 2026-06-25 04:06:21 | In core use of RegisterXactCallback() and RegisterSubXactCallback() |