| From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
|---|---|
| To: | Hannu Krosing <hannuk(at)google(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}') |
| Date: | 2026-05-29 13:29:46 |
| Message-ID: | 63b5f986-d543-42e2-9ec5-b1fb44e16c00@iki.fi |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 29/05/2026 14:44, Hannu Krosing wrote:
> Hi hackers
>
> I was loading our mailing list into a database and noticed that some
> text results int substring not working.
>
> Specifically calling substring with some specific values fails
>
> badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
> ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2
> LOCATION: report_invalid_encoding_int, mbutils.c:1847
>
> Asking one byte longer substring works ok
>
> badutf8=# select ctid, id, substring(body, 1, 5) from bademail;
> ctid │ id │ substring
> ───────┼────┼───────────
> (0,1) │ 1 │ Hi ev
> (1 row)
>
> as do other ways of getting the same 4 bytes
>
> badutf8=# select ctid, id, substring(body from '^.{4}') from bademail;
> ctid │ id │ substring
> ───────┼────┼───────────
> (0,1) │ 1 │ Hi e
> (1 row)
>
> badutf8=# select ctid, id, substring(normalize(body), 1, 4) from bademail;
> ctid │ id │ substring
> ───────┼────┼───────────
> (0,1) │ 1 │ Hi e
> (1 row)
>
> is this expected behaviour and I just have to always noirmalize when
> loading exotic UTF8 strings ?
Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be
able to load it into the database in the first place. If it is valid,
then the substring() should work.
> If you want to replicate this use attached python script to load data
I could not reproduce this. That substring() query after running your
script works fine for me.
badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
ctid | id | substring
-------+----+-----------
(0,1) | 1 | Hi e
(1 row)
Which version did you use? What is the database's encoding and what is
the client encoding? I used 'master', with UTF-8 as server and client
encoding.
- Heikki
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Fujii Masao | 2026-05-29 13:31:31 | Re: Fix race in ReplicationSlotRelease for ephemeral slots |
| Previous Message | Jakub Wartak | 2026-05-29 13:22:50 | Re: Heads Up: cirrus-ci is shutting down June 1st |