Re: dict_synonym.c: fix truncation of multibyte sequence

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tristan Partin <tristan(at)partin(dot)io>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: dict_synonym.c: fix truncation of multibyte sequence
Date: 2026-06-05 17:37:03
Message-ID: 8cf296c265a367e08bf221781c4ba6c3f3726fda.camel@j-davis.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2026-06-05 at 15:57 +0000, Tristan Partin wrote:
> > In any case, the input comes from a trusted
> > source (dictionary configuration), so it's not very serious.
>
> The fix looks and sounds good. Do we have any way to test this, so it
> doesn't regress in the future?

-- Ⱥ is 2 bytes, 'ⱥ' is 3 bytes
$ echo "foo barȺ" > /path/to/postgres/share/tsearch_data/mbtest.syn

CREATE TEXT SEARCH DICTIONARY mb_syn (
TEMPLATE = synonym,
SYNONYMS = mbtest);

SELECT ts_lexize('mb_syn', 'foo');

=# SELECT ts_lexize('mb_syn', 'foo'); -- before patch
ts_lexize
-----------
{bar}
(1 row)

=# SELECT ts_lexize('mb_syn', 'foo'); -- after patch
ts_lexize
-----------
{barⱥ}
(1 row)

It requires a specially-crafted synonym file, and I'm not sure it's
worth much effort to add a test for this specific path. If we see
similar bugs, it's more likely to be somewhere else that makes the same
faulty assumption.

If you do think we should add tests, we should probably add a set of
dictionary-related files (.syn, .dict, .ths, etc.) that contain a
variety of adversarial Unicode cases.

I'd be inclined to just commit this fix for now. It needs backpatching,
and I don't think we want to backpatch a large set of tests with it.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2026-06-05 17:43:50 Re: postgres_fdw: Emit message when batch_size is reduced
Previous Message Nathan Bossart 2026-06-05 17:12:04 Re: [PATCH] refint: Avoid reusing cascade UPDATE plans.