Re: [POC] verifying UTF-8 using SIMD instructions

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] verifying UTF-8 using SIMD instructions
Date: 2021-02-20 21:10:58
Message-ID: CAFBsxsFgKt3ktbnghM_5LyTXEov5+XNx5cJ+E6AbL+3Rh-XKcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I made some substantial improvements in v5, and I've taken care of all my
TODOs below. I separated out the non-UTF-8 ascii fast path into a separate
patch, since it's kind of off-topic, and it's not yet clear it's always the
best thing to do.

> - It takes almost no recognizable code from simdjson, but it does take
the magic constants lookup tables almost verbatim. The main body of the
code has no intrinsics at all (I think). They're all hidden inside static
inline helper functions. I reused some cryptic variable names from
simdjson. It's a bit messy but not terrible.

In v5, the lookup tables and their comments are cleaned up and modified to
play nice with pgindent.

> - It diffs against the noError conversion patch and adds additional tests.

I wanted to get some cfbot testing, so I went ahead and prepended v4 of
Heikki's noError patch so it would apply against master.

> - There is no ascii fast-path yet. With this algorithm we have to be a
bit more careful since a valid ascii chunk could be preceded by an
incomplete sequence at the end of the previous chunk. Not too hard, just a
bit more work.

v5 adds an ascii fast path.

> - I had to add a large number of casts to get rid of warnings in the
magic constants macros. That needs some polish.

This is much nicer now, only one cast really necessary.

I'm pretty pleased with how it is now, but it could use some thorough
testing for correctness. I'll work on that a bit later.

On my laptop, Clang 10:

master:

chinese | mixed | ascii
---------+-------+-------
1081 | 761 | 366

v5:

chinese | mixed | ascii
---------+-------+-------
136 | 93 | 54

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v4-0001-Add-noError-argument-to-encoding-conversion-funct.patch application/octet-stream 230.6 KB
v5-0002-Use-SSE-4-for-verifying-UTF-8-text.patch application/octet-stream 49.8 KB
v5-0003-Add-an-ASCII-fast-path-to-non-UTF-8-encoding-veri.patch application/octet-stream 3.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Guillaume Lelarge 2021-02-20 21:38:36 Re: Extensions not dumped when --schema is used
Previous Message Markus Wanner 2021-02-20 20:44:30 Re: [PATCH] Present all committed transaction to the output plugin