Re: speed up verifying UTF-8

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speed up verifying UTF-8
Date: 2021-06-09 11:02:02
Message-ID: e7729297-53e8-6e17-7334-7227043ce716@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/06/2021 15:39, John Naylor wrote:
> On Mon, Jun 7, 2021 at 8:24 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi
> <mailto:hlinnaka(at)iki(dot)fi>> wrote:
> >
> > On 03/06/2021 21:58, John Naylor wrote:
> > > The microbenchmark is the same one you attached to [1], which I
> extended
> > > with a 95% multibyte case.
> >
> > Could you share the exact test you're using? I'd like to test this on my
> > old raspberry pi, out of curiosity.
>
> Sure, attached.
>
> --
> John Naylor
> EDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
>
Results from chipmunk, my first generation Raspberry Pi:

Master:

chinese | mixed | ascii
---------+-------+-------
25392 | 16287 | 10295
(1 row)

v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch:

chinese | mixed | ascii
---------+-------+-------
17739 | 10854 | 4121
(1 row)

So that's good.

What is the worst case scenario for this algorithm? Something where the
new fast ASCII check never helps, but is as fast as possible with the
old code. For that, I added a repeating pattern of '123456789012345ä' to
the test set (these results are from my Intel laptop, not the raspberry pi):

Master:

chinese | mixed | ascii | mixed2
---------+-------+-------+--------
1333 | 757 | 410 | 573
(1 row)

v11-0001-Rewrite-pg_utf8_verifystr-for-speed.patch:

chinese | mixed | ascii | mixed2
---------+-------+-------+--------
942 | 470 | 66 | 1249
(1 row)

So there's a regression with that input. Maybe that's acceptable, this
is the worst case, after all. Or you could tweak check_ascii for a
different performance tradeoff, by checking the two 64-bit words
separately and returning "8" if the failure happens in the second word.
And I haven't tried the SSE patch yet, maybe that compensates for this.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2021-06-09 11:08:36 Re: postgres_fdw batching vs. (re)creating the tuple slots
Previous Message Amit Kapila 2021-06-09 10:51:51 Re: Decoding speculative insert with toast leaks memory