Quick Links

Re: Small patch to improve safety of utf8_to_unicode().

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Small patch to improve safety of utf8_to_unicode().
Date:	2026-06-25 17:38:40
Message-ID:	0cedb517aee8b79b8e31a4e42885ed88c2a67c5f.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, 2026-06-25 at 12:10 +0800, Chao Li wrote:
> This uses the second byte, \x20, without validating. So it looks like
> the patch prevents reading past the end of the string, but it may not
> fully defend against invalid UTF-8 sequences.

Correct. We don't do full UTF8 validation until the last patch in the
series, which is not being backported.

Trying to do full validation in the backbranches seems more likely to
cause problems than prevent them. We aren't expecting invalid UTF8, but
in the event it got there somehow (perhaps from an old upgraded
instance), throwing errors after a minor release is probably not
helpful.

Even in master, I am not 100% sure we want to detect other kinds of
validation errors while processing the UTF8. By the time we are using
the value, maybe truncated multibyte sequences are the only thing we
care about, and we just need to be sure the code can handle anything
that fits in a char32_t.

Another thing to consider is an embedded NUL character, which is valid
UTF8 but not valid in a TEXT value.

Regards,
Jeff Davis

In response to

Re: Small patch to improve safety of utf8_to_unicode(). at 2026-06-25 04:10:26 from Chao Li

Responses

Re: Small patch to improve safety of utf8_to_unicode(). at 2026-06-26 04:38:49 from Chao Li

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Zsolt Parragi	2026-06-25 17:38:50	Re: glob support in extension_control_path/dynamic_library_path?
Previous Message	Sami Imseih	2026-06-25 17:36:27	Re: pg_stat_statements: Remove (errcode...) framing parentheses in erport(...)