| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}') |
| Date: | 2026-05-29 11:44:26 |
| Message-ID: | CAMT0RQRzbs7sBc3eN--gF6NaVJOh5og+T3q0oCqPxrYo84f+MQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi hackers
I was loading our mailing list into a database and noticed that some
text results int substring not working.
Specifically calling substring with some specific values fails
badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2
LOCATION: report_invalid_encoding_int, mbutils.c:1847
Asking one byte longer substring works ok
badutf8=# select ctid, id, substring(body, 1, 5) from bademail;
ctid │ id │ substring
───────┼────┼───────────
(0,1) │ 1 │ Hi ev
(1 row)
as do other ways of getting the same 4 bytes
badutf8=# select ctid, id, substring(body from '^.{4}') from bademail;
ctid │ id │ substring
───────┼────┼───────────
(0,1) │ 1 │ Hi e
(1 row)
badutf8=# select ctid, id, substring(normalize(body), 1, 4) from bademail;
ctid │ id │ substring
───────┼────┼───────────
(0,1) │ 1 │ Hi e
(1 row)
is this expected behaviour and I just have to always noirmalize when
loading exotic UTF8 strings ?
If you want to replicate this use attached python script to load data
I tried to come up with pure SQL to load the data but that always
produced values that did not fail.
Even shortening the body by removing some "normal" characters between
the \xc2\x... sequences froduces values which do not fail
substring(txt, a, b)
I would have shared the original code to load a mailbox, but this
fails to decode that specifioc email on some machines. I did not yet
go into why that happens
| Attachment | Content-Type | Size |
|---|---|---|
| breaking_utf8.py | text/x-python | 4.2 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Shlok Kyal | 2026-05-29 11:59:11 | Re: Support EXCEPT for ALL SEQUENCES publications |
| Previous Message | Jakub Wartak | 2026-05-29 11:38:17 | Re: Heads Up: cirrus-ci is shutting down June 1st |