Re: Speed up ICU case conversion by using ucasemap_utf8To*()

From: Andreas Karlsson <andreas(at)proxel(dot)se>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, zengman <zengman(at)halodbtech(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up ICU case conversion by using ucasemap_utf8To*()
Date: 2026-04-01 00:46:23
Message-ID: dcea4840-18d0-4b5f-af16-1baefc563a3d@proxel.se
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/12/26 5:00 AM, Alexander Lakhin wrote:
> I've discovered that starting from c4ff35f10, the following query:
> CREATE COLLATION c (provider = icu, locale = 'icu_something');
>
> makes asan detect (maybe dubious, but still..) stack-buffer-overflow:
> ==21963==ERROR: AddressSanitizer: stack-buffer-overflow on address
> 0x7ffd386d4e63 at pc 0x650cd7972a76 bp 0x7ffd386d4e00 sp 0x7ffd386d45a8
> ...
> Address 0x7ffd386d4e63 is located in stack of thread T0 at offset 67 in
> frame
>     #0 0x650cd86962ef in foldcase_options (.../usr/local/pgsql/bin/
> postgres+0x12322ef) (BuildId: e441a9634858193e7358e5901e7948606ff5b1b1)
>
>   This frame has 2 object(s):
>     [48, 52) 'status' (line 993)
>     [64, 67) 'lang' (line 992) <== Memory access at offset 67 overflows
> this variable
>
> I use a build made with:
> CC=gcc-13 CPPFLAGS="-fsanitize=address" LDFLAGS="-fsanitize=address -
> static-libasan" ./configure --with-icu ...
>
> Could you please have a look?
Thanks for finding this!

Interestingly this bug seems like it would be there even before my
patch, but maybe something I did made it when moving code around made it
possible or easier to trigger. As far as I can tell the issue is that

uloc_getLanguage(locale, lang, 3, &status);

will populate lang with a string which is not zero terminated if the
language is 3 or more characters, e.g. "und". And for some reason which
I am not entirely strcmp("tr", {'u','n','d'}) can cause an overflow.
Maybe due to some optimization?

My proposed fix is that we allocate a ULOC_LANG_CAPACITY buffer for the
language like we do in fix_icu_locale_str() instead of trying to be
clever. An alternative would be to use strncmp("tr", lang, 3) but that
seems too clever for my taste in something which is not performance
critical. A third option would be to check for
U_STRING_NOT_TERMINATED_WARNING but I think that would just be
unnecessarily convoluted code.

I have attached my proposed fix.

Andreas

Attachment Content-Type Size
v1-0001-Fix-overrun-when-comparing-with-unterminated-ICU-.patch text/x-patch 1.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2026-04-01 02:00:00 Re: More speedups for tuple deformation
Previous Message Peter Smith 2026-04-01 00:42:55 Re: Skipping schema changes in publication