| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}') |
| Date: | 2026-05-29 15:16:13 |
| Message-ID: | CAMT0RQQe4s=OsFWKsnZjbmP-N2_PxGzvs8vQXJcwapmZ022JRw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Body is valid, can be selected in full and the the body can be also
fed to other functions, including version of substring using regexes.
badutf8=# select ctid, id, length(body), length(normalize(body)),
body=normalize(body), octet_length(body),
octet_length(normalize(body)) from bademail;
ctid │ id │ length │ length │ ?column? │ octet_length │ octet_length
───────┼────┼────────┼────────┼──────────┼──────────────┼──────────────
(0,1) │ 1 │ 2314 │ 2314 │ t │ 2323 │ 2323
(1 row)
The most confusing thing is that the byte it complains about in case
of substring(body, 1, 3) or substring(body, 1, 4) does not seem to be
present in the original string at all and definitely not within the
first few characters I ams asking to extract
badutf8=# select ctid, id, substring(body, 1, 2) from bademail;
ctid │ id │ substring
───────┼────┼───────────
(0,1) │ 1 │ Hi
(1 row)
Time: 0.527 ms
badutf8=# select ctid, id, substring(body, 1, 3) from bademail;
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc3
LOCATION: report_invalid_encoding_int, mbutils.c:1847
Time: 0.638 ms
badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2
LOCATION: report_invalid_encoding_int, mbutils.c:1847
Time: 0.555 ms
badutf8=# select ctid, id, substring(body, 1, 5) from bademail;
ctid │ id │ substring
───────┼────┼───────────
(0,1) │ 1 │ Hi ev
(1 row)
On Fri, May 29, 2026 at 3:29 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 29/05/2026 14:44, Hannu Krosing wrote:
> > Hi hackers
> >
> > I was loading our mailing list into a database and noticed that some
> > text results int substring not working.
> >
> > Specifically calling substring with some specific values fails
> >
> > badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
> > ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2
> > LOCATION: report_invalid_encoding_int, mbutils.c:1847
> >
> > Asking one byte longer substring works ok
> >
> > badutf8=# select ctid, id, substring(body, 1, 5) from bademail;
> > ctid │ id │ substring
> > ───────┼────┼───────────
> > (0,1) │ 1 │ Hi ev
> > (1 row)
> >
> > as do other ways of getting the same 4 bytes
> >
> > badutf8=# select ctid, id, substring(body from '^.{4}') from bademail;
> > ctid │ id │ substring
> > ───────┼────┼───────────
> > (0,1) │ 1 │ Hi e
> > (1 row)
> >
> > badutf8=# select ctid, id, substring(normalize(body), 1, 4) from bademail;
> > ctid │ id │ substring
> > ───────┼────┼───────────
> > (0,1) │ 1 │ Hi e
> > (1 row)
> >
> > is this expected behaviour and I just have to always noirmalize when
> > loading exotic UTF8 strings ?
>
> Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be
> able to load it into the database in the first place. If it is valid,
> then the substring() should work.
>
> > If you want to replicate this use attached python script to load data
>
> I could not reproduce this. That substring() query after running your
> script works fine for me.
>
> badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
> ctid | id | substring
> -------+----+-----------
> (0,1) | 1 | Hi e
> (1 row)
>
> Which version did you use? What is the database's encoding and what is
> the client encoding? I used 'master', with UTF-8 as server and client
> encoding.
>
> - Heikki
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Lucas Jeffrey | 2026-05-29 15:32:06 | [PATCH] Fix segmentation fault caused by reentrancy in RI_Fkey_cascade_del (ri_triggers.c) |
| Previous Message | Nathan Bossart | 2026-05-29 15:14:29 | Re: future of PQfn() |