Quick Links

Re: Improve the performance of Unicode Normalization Forms.

From:	Nico Williams <nico(at)cryptonector(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve the performance of Unicode Normalization Forms.
Date:	2025-06-20 18:45:22
Message-ID:	aFWsQsTmiG44+z8P@ubby
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Jun 20, 2025 at 10:15:47AM -0700, Jeff Davis wrote:
> On Fri, 2025-06-20 at 11:31 -0500, Nico Williams wrote:
> > In the slow path you only normalize the _current character_, so you
> > only need enough buffer space for that.
>
> That's a clear win for UTF8 data. Also, if there are no changes, then
> you can just return the input buffer and not bother allocating an
> output buffer.

The latter is not relevant to string comparison or hashing, but, yeah,
if you have to produce a normalized string you can optimistically assume
it is already normalized and defer allocation until you know it isn't
normalized.

> Postgres is already form-preserving; it does not auto-normalize. (I
> have suggested that we might want to offer something like that, but
> that would be a user choice.)

Excellent, then I would advise looking into adding form-insensitive
string comparison and hashing to get f-i/f-p behavior.

> Currently, the non-deterministic collations (which offer form-
> insensitivity) are not available at the database level, so you have to
> explicitly specify the COLLATE clause on a column or query. In other
> words, Postgres is not form-insensitive by default, though there is
> work to make that possible.

TIL. Thanks.

> Databases have similar concerns as a filesystem in this respect.

I figured :)

In response to

Re: Improve the performance of Unicode Normalization Forms. at 2025-06-20 17:15:47 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2025-06-20 19:12:43	Re: problems with toast.* reloptions
Previous Message	Jeff Davis	2025-06-20 17:20:08	Re: Improve the performance of Unicode Normalization Forms.