Re: Small patch to improve safety of utf8_to_unicode().

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Small patch to improve safety of utf8_to_unicode().
Date: 2026-06-19 23:22:08
Message-ID: fbcb039a9ab3ba834f34174915254732fdcfae86.camel@j-davis.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2025-12-17 at 11:37 -0800, Jeff Davis wrote:
> On Tue, 2025-12-16 at 07:34 +0800, Chao Li wrote:
> > > <v2-0001-Make-utf8_to_unicode-safer.patch>
> >
> > V2 LGTM.
>
> On second thought, if we're going to change something here, we should
> probably have a more flexible API for both utf8_to_unicode() and
> unicode_to_utf8().

New series:

0001: validates UTF8 before calling into unicode_case.c. Extra defense,
and simple to backport, but regresses performance of those functions.
It also might risk errors if somehow there is invalid UTF8.

0002: refactors to create an error path from unicode_case.c into
pg_locale_builtin.c, where a proper error can be thrown. This wins back
the performance lost in the previous commit. This is perhaps
backportable, but technically it changes an exported function
signature, so carries some very low risk.

0003: Adds utf8encode() and utf8decode(), which are iteration-friendly
and inlinable, and fully-validate UTF8 (e.g. rejects surrogate halves).
This is an enhancement so should not be backported.

0004: Make use of new API from unicode_case.c.

Regards,
Jeff Davis

Attachment Content-Type Size
v3-0001-unicode_case.c-ensure-valid-UTF8.patch text/x-patch 1.8 KB
v3-0002-Move-UTF8-checks-into-unicode_case.c.patch text/x-patch 15.0 KB
v3-0003-Validating-iterator-friendly-UTF8-encoder-decoder.patch text/x-patch 5.3 KB
v3-0004-unicode_case.c-use-new-utf8encode-utf8decode-APIs.patch text/x-patch 6.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2026-06-20 02:21:10 Improve UNION's output rowcount estimate
Previous Message Masahiko Sawada 2026-06-19 22:33:21 Add a hook for handling logical decoding messages on subscribers.