Quick Links

Re: Small patch to improve safety of utf8_to_unicode().

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Small patch to improve safety of utf8_to_unicode().
Date:	2025-12-17 19:37:59
Message-ID:	c4b27c82decf5b85523ad01f1df3785d8999f333.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, 2025-12-16 at 07:34 +0800, Chao Li wrote:
> > <v2-0001-Make-utf8_to_unicode-safer.patch>
>
> V2 LGTM.

On second thought, if we're going to change something here, we should
probably have a more flexible API for both utf8_to_unicode() and
unicode_to_utf8().

Looking at the callers, I think we want to have signatures something
like:

/* returns number of bytes consumed, or -1 */
static inline ssize_t
utf8_to_unicode(char32_t *cp, const unsigned char *src, size_t srclen)
{
...
}

/* returns number of bytes written, or -1 */
static inline ssize_t
unicode_to_utf8(unsigned char *dst, size_t dstsize, char32_t cp)
{
...
}

That would make both APIs safer, and the caller wouldn't need to call
unicode_utf8len() or pg_utf8_mblen() separately.

We could also do more validation, but of course then the callers would
need to do something if they encounter a failure. We could also try to
catch NUL terminators in the middle of a sequence, which might be
useful.

Regards,
Jeff Davis

In response to

Re: Small patch to improve safety of utf8_to_unicode(). at 2025-12-15 23:34:41 from Chao Li

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Corey Huinker	2025-12-17 19:38:27	Re: pg_dump: Remove trivial usage of PQExpBuffer
Previous Message	Andres Freund	2025-12-17 19:30:15	Re: index prefetching