| From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
|---|---|
| To: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Small patch to improve safety of utf8_to_unicode(). |
| Date: | 2026-06-25 17:38:40 |
| Message-ID: | 0cedb517aee8b79b8e31a4e42885ed88c2a67c5f.camel@j-davis.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, 2026-06-25 at 12:10 +0800, Chao Li wrote:
> This uses the second byte, \x20, without validating. So it looks like
> the patch prevents reading past the end of the string, but it may not
> fully defend against invalid UTF-8 sequences.
Correct. We don't do full UTF8 validation until the last patch in the
series, which is not being backported.
Trying to do full validation in the backbranches seems more likely to
cause problems than prevent them. We aren't expecting invalid UTF8, but
in the event it got there somehow (perhaps from an old upgraded
instance), throwing errors after a minor release is probably not
helpful.
Even in master, I am not 100% sure we want to detect other kinds of
validation errors while processing the UTF8. By the time we are using
the value, maybe truncated multibyte sequences are the only thing we
care about, and we just need to be sure the code can handle anything
that fits in a char32_t.
Another thing to consider is an embedded NUL character, which is valid
UTF8 but not valid in a TEXT value.
Regards,
Jeff Davis
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Zsolt Parragi | 2026-06-25 17:38:50 | Re: glob support in extension_control_path/dynamic_library_path? |
| Previous Message | Sami Imseih | 2026-06-25 17:36:27 | Re: pg_stat_statements: Remove (errcode...) framing parentheses in erport(...) |