Re: Small patch to improve safety of utf8_to_unicode().

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Small patch to improve safety of utf8_to_unicode().
Date: 2026-06-26 04:38:49
Message-ID: 16CAD4DB-9D4A-481A-BB4E-78775C4F4B2C@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jun 26, 2026, at 01:38, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Thu, 2026-06-25 at 12:10 +0800, Chao Li wrote:
>> This uses the second byte, \x20, without validating. So it looks like
>> the patch prevents reading past the end of the string, but it may not
>> fully defend against invalid UTF-8 sequences.
>
> Correct. We don't do full UTF8 validation until the last patch in the
> series, which is not being backported.
>

Sounds like 0001 will be back patched. In that case, the commit message "defend against invalid UTF8” seems too broad. Does it make sense to add some brief description about the defend behavior to the function header comment and the commit message?

Then I continue to review 0002-0005:

0002 - overall looks good. A small comment is:
```
+ for (int i = offset; i > 0;)

+ for (int i = offset + ulen; i < len;)
```

As offset is of type size_t, the loop variable i is better to be size_t.

0003 - looks good.

0004 - looks good. This commit introduces a new helper utf8decode() that will resolve my previous concern on 0001.

0005 - Mostly looks good. This commit applies the new help and my previous concern is resolved. But from what you talked, I guess 0004 and 0005 will only be pushed to HEAD.

Just one tiny comment on 0005:
```
+ /* invalid UTF8: surrogates */
+ needed = unicode_strfold(NULL, 0, "abc\xED\xA0\x81xyz", 7, &consumed, false);
+ Assert(needed == 3 && consumed == 3);
```

This test passes a 10-char string but uses 7 as srclen. I know that doesn’t affect the test result, but it just adds unnecessary confusion to readers. So maybe change 7 to 10 to reflect to the real string length.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message JoongHyuk Shin 2026-06-26 04:52:08 Re: [PATCH] Prevent repeated deadlock-check signals in standby buffer pin waits
Previous Message shveta malik 2026-06-26 04:14:42 Re: Proposal: Conflict log history table for Logical Replication