Re: speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Vladimir Sitnikov <sitnikov(dot)vladimir(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: speed up verifying UTF-8
Date: 2021-07-30 01:12:33
Message-ID: CAFBsxsHDXCROQe-UC1nZOdcdaCO90rihiYhBYrLHrf_sLKUY=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 26, 2021 at 8:56 AM John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
wrote:
>
> >
> > Does that (and "len >= 32" condition) mean the patch does not improve
validation of the shorter strings (the ones less than 32 bytes)?
>
> Right. Also, the 32 byte threshold was just a temporary need for testing
32-byte stride -- testing different thresholds wouldn't hurt. I'm not
terribly concerned about short strings, though, as long as we don't
regress.

I put together the attached quick test to try to rationalize the fast-path
threshold. (In case it isn't obvious, it must be at least 16 on all builds,
since wchar.c doesn't know which implementation it's calling, and SSE
register width sets the lower bound.) I changed the threshold first to 16,
and then 100000, which will force using the byte-at-a-time code.

If we have only 16 bytes in the input, it still seems to be faster to use
SSE, even though it's called through a function pointer on x86. I didn't
test the DFA path, but I don't think the conclusion would be different.
I'll include the 16 threshold next time I need to update the patch.

Macbook x86, clang 12:

master + use 16:
asc16 | asc32 | asc64 | mb16 | mb32 | mb64
-------+-------+-------+------+------+------
270 | 279 | 282 | 291 | 296 | 304

force byte-at-a-time:
asc16 | asc32 | asc64 | mb16 | mb32 | mb64
-------+-------+-------+------+------+------
277 | 292 | 310 | 296 | 317 | 362

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
mbverifystr-threshold.sql application/octet-stream 1.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2021-07-30 02:02:57 Re: pg_upgrade does not upgrade pg_stat_statements properly
Previous Message Andres Freund 2021-07-30 01:03:55 Re: Autovacuum on partitioned table (autoanalyze)