| From: | Hannu Krosing <hannuk(at)google(dot)com> |
|---|---|
| To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: some utf8 breaking substring(txt,1,3) but not substring(txt from '^.{4}') |
| Date: | 2026-05-29 15:34:46 |
| Message-ID: | CAMT0RQRQ-+OA+Jn+BZwar07XtPBchanynhfrT5Fe-wz=VPuYKQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
But looks like I can not reproduce it on any other computer (I though
I had verified it on vanill Ubuntu 22.04 as well), so no need to do
any more investigation until I can reproduce it elsewhere
On Fri, May 29, 2026 at 5:16 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
> Body is valid, can be selected in full and the the body can be also
> fed to other functions, including version of substring using regexes.
>
> badutf8=# select ctid, id, length(body), length(normalize(body)),
> body=normalize(body), octet_length(body),
> octet_length(normalize(body)) from bademail;
> ctid │ id │ length │ length │ ?column? │ octet_length │ octet_length
> ───────┼────┼────────┼────────┼──────────┼──────────────┼──────────────
> (0,1) │ 1 │ 2314 │ 2314 │ t │ 2323 │ 2323
> (1 row)
>
> The most confusing thing is that the byte it complains about in case
> of substring(body, 1, 3) or substring(body, 1, 4) does not seem to be
> present in the original string at all and definitely not within the
> first few characters I ams asking to extract
>
>
> badutf8=# select ctid, id, substring(body, 1, 2) from bademail;
> ctid │ id │ substring
> ───────┼────┼───────────
> (0,1) │ 1 │ Hi
> (1 row)
>
> Time: 0.527 ms
> badutf8=# select ctid, id, substring(body, 1, 3) from bademail;
> ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc3
> LOCATION: report_invalid_encoding_int, mbutils.c:1847
> Time: 0.638 ms
> badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
> ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2
> LOCATION: report_invalid_encoding_int, mbutils.c:1847
> Time: 0.555 ms
> badutf8=# select ctid, id, substring(body, 1, 5) from bademail;
> ctid │ id │ substring
> ───────┼────┼───────────
> (0,1) │ 1 │ Hi ev
> (1 row)
>
> On Fri, May 29, 2026 at 3:29 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> >
> > On 29/05/2026 14:44, Hannu Krosing wrote:
> > > Hi hackers
> > >
> > > I was loading our mailing list into a database and noticed that some
> > > text results int substring not working.
> > >
> > > Specifically calling substring with some specific values fails
> > >
> > > badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
> > > ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2
> > > LOCATION: report_invalid_encoding_int, mbutils.c:1847
> > >
> > > Asking one byte longer substring works ok
> > >
> > > badutf8=# select ctid, id, substring(body, 1, 5) from bademail;
> > > ctid │ id │ substring
> > > ───────┼────┼───────────
> > > (0,1) │ 1 │ Hi ev
> > > (1 row)
> > >
> > > as do other ways of getting the same 4 bytes
> > >
> > > badutf8=# select ctid, id, substring(body from '^.{4}') from bademail;
> > > ctid │ id │ substring
> > > ───────┼────┼───────────
> > > (0,1) │ 1 │ Hi e
> > > (1 row)
> > >
> > > badutf8=# select ctid, id, substring(normalize(body), 1, 4) from bademail;
> > > ctid │ id │ substring
> > > ───────┼────┼───────────
> > > (0,1) │ 1 │ Hi e
> > > (1 row)
> > >
> > > is this expected behaviour and I just have to always noirmalize when
> > > loading exotic UTF8 strings ?
> >
> > Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be
> > able to load it into the database in the first place. If it is valid,
> > then the substring() should work.
> >
> > > If you want to replicate this use attached python script to load data
> >
> > I could not reproduce this. That substring() query after running your
> > script works fine for me.
> >
> > badutf8=# select ctid, id, substring(body, 1, 4) from bademail;
> > ctid | id | substring
> > -------+----+-----------
> > (0,1) | 1 | Hi e
> > (1 row)
> >
> > Which version did you use? What is the database's encoding and what is
> > the client encoding? I used 'master', with UTF-8 as server and client
> > encoding.
> >
> > - Heikki
> >
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Amit Kapila | 2026-05-29 15:39:18 | Re: Adding REPACK [concurrently] |
| Previous Message | Lucas Jeffrey | 2026-05-29 15:32:06 | [PATCH] Fix segmentation fault caused by reentrancy in RI_Fkey_cascade_del (ri_triggers.c) |