some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}')

From: Hannu Krosing <hannuk(at)google(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}')
Date: 2026-05-29 11:44:26
Message-ID: CAMT0RQRzbs7sBc3eN--gF6NaVJOh5og+T3q0oCqPxrYo84f+MQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers

I was loading our mailing list into a database and noticed that some
text results int substring not working.

Specifically calling substring with some specific values fails

badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2
LOCATION: report_invalid_encoding_int, mbutils.c:1847

Asking one byte longer substring works ok

badutf8=# select ctid, id, substring(body, 1, 5) from bademail;
ctid │ id │ substring
───────┼────┼───────────
(0,1) │ 1 │ Hi ev
(1 row)

as do other ways of getting the same 4 bytes

badutf8=# select ctid, id, substring(body from '^.{4}') from bademail;
ctid │ id │ substring
───────┼────┼───────────
(0,1) │ 1 │ Hi e
(1 row)

badutf8=# select ctid, id, substring(normalize(body), 1, 4) from bademail;
ctid │ id │ substring
───────┼────┼───────────
(0,1) │ 1 │ Hi e
(1 row)

is this expected behaviour and I just have to always noirmalize when
loading exotic UTF8 strings ?

If you want to replicate this use attached python script to load data

I tried to come up with pure SQL to load the data but that always
produced values that did not fail.

Even shortening the body by removing some "normal" characters between
the \xc2\x... sequences froduces values which do not fail
substring(txt, a, b)

I would have shared the original code to load a mailbox, but this
fails to decode that specifioc email on some machines. I did not yet
go into why that happens

Attachment Content-Type Size
breaking_utf8.py text/x-python 4.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shlok Kyal 2026-05-29 11:59:11 Re: Support EXCEPT for ALL SEQUENCES publications
Previous Message Jakub Wartak 2026-05-29 11:38:17 Re: Heads Up: cirrus-ci is shutting down June 1st