Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: ranvis(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
Date: 2026-02-14 19:07:22
Message-ID: CA+hUKGKy7uqJv9HDMPjDR6O8bv9FSvdg2X5k8z3J1k6QhnJoCg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Feb 14, 2026 at 9:15 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Sat, Feb 14, 2026 at 6:38 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > [mblen-valgrind-after-report-v1.patch]
>
> LGTM. The new valgrind check should clearly be after the new non-local exit.
>
> Studying the other patch...

/*
- * Total slice size in bytes can't be any
longer than the start
- * position plus substring length times the
encoding max length.
- * If that overflows, we can just use -1.
+ * Total slice size in bytes can't be any
longer than the
+ * inclusive end position times the encoding
max length. If that
+ * overflows, we can just use -1.
*/
- if (pg_mul_s32_overflow(E, eml, &slice_size))
+ if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
slice_size = -1;

Isn't it still conceptually "exclusive", but adjusted to be zero-indexed?

/* Now we can get the actual length of the slice in MB
characters */
- slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
-
slice_len);
+ slice_strlen =
+ (slice_size == -1 ?
+ pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) :
+ pg_mbcharcliplen_chars(VARDATA_ANY(slice),
slice_len, E - 1));

Comment presumably needs adjustment to say that we only count as far
as we need to, and why.

There is something a bit strange about all this, though.
pg_mbstrlen_with_len(..., -1) returns 0, so if you ask for characters
that really exist past 2^29 (~500 million), you must get an empty
string, right? That's hard to reach, pre-existing and out of scope
for the immediate problem report, except ... now we're contorting the
code even further to keep it.

The outline I had come up with before seeing your patch was: let's
just delete it. The position search can check bounds incrementally,
following our general approach. This avoids the reported problem by
ditching the pre-flight scan through the slice (up to 4x more
pg_mblen_XXX calls and memory access than we strictly need), and also
the special cases for empty strings since they already fall out of the
general behaviour (am I missing something?), not leaving much code
behind. As far as I can see so far, the only user-visible side-effect
requires corruption: substring() moves from the
internal-NUL-is-terminator category to internal-NUL-is-character
category, but that's an implementation detail.

When I saw your patch yesterday, I initially abandoned the thought,
thinking that your idea looked more conservative, but after sleeping
on it and reflecting again on these oddities, I have merged my draft
implementation with your tests, ancient detoasting fence post
observation and commit message, just to see if you think this approach
might be worth considering further.

Attachment Content-Type Size
v2tm-0001-Fix-SUBSTRING-for-toasted-multibyte-characters.patch application/x-patch 13.0 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Noah Misch 2026-02-14 19:33:44 Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16
Previous Message Andrey Borodin 2026-02-14 17:41:43 Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"