Re: Small patch to improve safety of utf8_to_unicode().

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Small patch to improve safety of utf8_to_unicode().
Date: 2025-12-17 19:37:59
Message-ID: c4b27c82decf5b85523ad01f1df3785d8999f333.camel@j-davis.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2025-12-16 at 07:34 +0800, Chao Li wrote:
> > <v2-0001-Make-utf8_to_unicode-safer.patch>
>
> V2 LGTM.

On second thought, if we're going to change something here, we should
probably have a more flexible API for both utf8_to_unicode() and
unicode_to_utf8().

Looking at the callers, I think we want to have signatures something
like:

/* returns number of bytes consumed, or -1 */
static inline ssize_t
utf8_to_unicode(char32_t *cp, const unsigned char *src, size_t srclen)
{
...
}

/* returns number of bytes written, or -1 */
static inline ssize_t
unicode_to_utf8(unsigned char *dst, size_t dstsize, char32_t cp)
{
...
}

That would make both APIs safer, and the caller wouldn't need to call
unicode_utf8len() or pg_utf8_mblen() separately.

We could also do more validation, but of course then the callers would
need to do something if they encounter a failure. We could also try to
catch NUL terminators in the middle of a sequence, which might be
useful.

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2025-12-17 19:38:27 Re: pg_dump: Remove trivial usage of PQExpBuffer
Previous Message Andres Freund 2025-12-17 19:30:15 Re: index prefetching