| From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
|---|---|
| To: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Small patch to improve safety of utf8_to_unicode(). |
| Date: | 2025-12-17 19:37:59 |
| Message-ID: | c4b27c82decf5b85523ad01f1df3785d8999f333.camel@j-davis.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, 2025-12-16 at 07:34 +0800, Chao Li wrote:
> > <v2-0001-Make-utf8_to_unicode-safer.patch>
>
> V2 LGTM.
On second thought, if we're going to change something here, we should
probably have a more flexible API for both utf8_to_unicode() and
unicode_to_utf8().
Looking at the callers, I think we want to have signatures something
like:
/* returns number of bytes consumed, or -1 */
static inline ssize_t
utf8_to_unicode(char32_t *cp, const unsigned char *src, size_t srclen)
{
...
}
/* returns number of bytes written, or -1 */
static inline ssize_t
unicode_to_utf8(unsigned char *dst, size_t dstsize, char32_t cp)
{
...
}
That would make both APIs safer, and the caller wouldn't need to call
unicode_utf8len() or pg_utf8_mblen() separately.
We could also do more validation, but of course then the callers would
need to do something if they encounter a failure. We could also try to
catch NUL terminators in the middle of a sequence, which might be
useful.
Regards,
Jeff Davis
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Corey Huinker | 2025-12-17 19:38:27 | Re: pg_dump: Remove trivial usage of PQExpBuffer |
| Previous Message | Andres Freund | 2025-12-17 19:30:15 | Re: index prefetching |