| From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
|---|---|
| To: | Noah Misch <noah(at)leadboat(dot)com> |
| Cc: | ranvis(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org |
| Subject: | Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 |
| Date: | 2026-02-14 19:07:22 |
| Message-ID: | CA+hUKGKy7uqJv9HDMPjDR6O8bv9FSvdg2X5k8z3J1k6QhnJoCg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
On Sat, Feb 14, 2026 at 9:15 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Sat, Feb 14, 2026 at 6:38 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > [mblen-valgrind-after-report-v1.patch]
>
> LGTM. The new valgrind check should clearly be after the new non-local exit.
>
> Studying the other patch...
/*
- * Total slice size in bytes can't be any
longer than the start
- * position plus substring length times the
encoding max length.
- * If that overflows, we can just use -1.
+ * Total slice size in bytes can't be any
longer than the
+ * inclusive end position times the encoding
max length. If that
+ * overflows, we can just use -1.
*/
- if (pg_mul_s32_overflow(E, eml, &slice_size))
+ if (pg_mul_s32_overflow(E - 1, eml, &slice_size))
slice_size = -1;
Isn't it still conceptually "exclusive", but adjusted to be zero-indexed?
/* Now we can get the actual length of the slice in MB
characters */
- slice_strlen = pg_mbstrlen_with_len(VARDATA_ANY(slice),
-
slice_len);
+ slice_strlen =
+ (slice_size == -1 ?
+ pg_mbstrlen_with_len(VARDATA_ANY(slice), slice_len) :
+ pg_mbcharcliplen_chars(VARDATA_ANY(slice),
slice_len, E - 1));
Comment presumably needs adjustment to say that we only count as far
as we need to, and why.
There is something a bit strange about all this, though.
pg_mbstrlen_with_len(..., -1) returns 0, so if you ask for characters
that really exist past 2^29 (~500 million), you must get an empty
string, right? That's hard to reach, pre-existing and out of scope
for the immediate problem report, except ... now we're contorting the
code even further to keep it.
The outline I had come up with before seeing your patch was: let's
just delete it. The position search can check bounds incrementally,
following our general approach. This avoids the reported problem by
ditching the pre-flight scan through the slice (up to 4x more
pg_mblen_XXX calls and memory access than we strictly need), and also
the special cases for empty strings since they already fall out of the
general behaviour (am I missing something?), not leaving much code
behind. As far as I can see so far, the only user-visible side-effect
requires corruption: substring() moves from the
internal-NUL-is-terminator category to internal-NUL-is-character
category, but that's an implementation detail.
When I saw your patch yesterday, I initially abandoned the thought,
thinking that your idea looked more conservative, but after sleeping
on it and reflecting again on these oddities, I have merged my draft
implementation with your tests, ancient detoasting fence post
observation and commit message, just to see if you think this approach
might be worth considering further.
| Attachment | Content-Type | Size |
|---|---|---|
| v2tm-0001-Fix-SUBSTRING-for-toasted-multibyte-characters.patch | application/x-patch | 13.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Noah Misch | 2026-02-14 19:33:44 | Re: BUG #19406: substring(text) fails on valid UTF-8 toasted value in PostgreSQL 15.16 |
| Previous Message | Andrey Borodin | 2026-02-14 17:41:43 | Re: 17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction" |