Re: [POC] verifying UTF-8 using SIMD instructions

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] verifying UTF-8 using SIMD instructions
Date: 2021-02-15 13:18:09
Message-ID: adf8e27e-4729-007c-2e10-852202128ac9@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13/02/2021 03:31, John Naylor wrote:
> On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi
> <mailto:hlinnaka(at)iki(dot)fi>> wrote:
> >
> > I also tested the fallback implementation from the simdjson library
> > (included in the patch, if you uncomment it in simdjson-glue.c):
> >
> >   mixed | ascii
> > -------+-------
> >     447 |    46
> > (1 row)
> >
> > I think we should at least try to adopt that. At a high level, it looks
> > pretty similar your patch: you load the data 8 bytes at a time, check if
> > there are all ASCII. If there are any non-ASCII chars, you check the
> > bytes one by one, otherwise you load the next 8 bytes. Your patch should
> > be able to achieve the same performance, if done right. I don't think
> > the simdjson code forbids \0 bytes, so that will add a few cycles, but
> > still.
>
> Attached is a patch that does roughly what simdjson fallback did, except
> I use straight tests on the bytes and only calculate code points in
> assertion builds. In the course of doing this, I found that my earlier
> concerns about putting the ascii check in a static inline function were
> due to my suboptimal loop implementation. I had assumed that if the
> chunked ascii check failed, it had to check all those bytes one at a
> time. As it turns out, that's a waste of the branch predictor. In the v2
> patch, we do the chunked ascii check every time we loop. With that, I
> can also confirm the claim in the Lemire paper that it's better to do
> the check on 16-byte chunks:
>
> (MacOS, Clang 10)
>
> master:
>
>  chinese | mixed | ascii
> ---------+-------+-------
>     1081 |   761 |   366
>
> v2 patch, with 16-byte stride:
>
>  chinese | mixed | ascii
> ---------+-------+-------
>      806 |   474 |    83
>
> patch but with 8-byte stride:
>
>  chinese | mixed | ascii
> ---------+-------+-------
>      792 |   490 |   105
>
> I also included the fast path in all other multibyte encodings, and that
> is also pretty good performance-wise.

Cool.

> It regresses from master on pure
> multibyte input, but that case is still faster than PG13, which I
> simulated by reverting 6c5576075b0f9 and b80e10638e3:

I thought the "chinese" numbers above are pure multibyte input, and it
seems to do well on that. Where does it regress? In multibyte encodings
other than UTF-8? How bad is the regression?

I tested this on my first generation Raspberry Pi (chipmunk). I had to
tweak it a bit to make it compile, since the SSE autodetection code was
not finished yet. And I used generate_series(1, 1000) instead of
generate_series(1, 10000) in the test script (mbverifystr-speed.sql)
because this system is so slow.

master:

mixed | ascii
-------+-------
1310 | 1041
(1 row)

v2-add-portability-stub-and-new-fallback.patch:

mixed | ascii
-------+-------
2979 | 910
(1 row)

I'm guessing that's because the unaligned access in check_ascii() is
expensive on this platform.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2021-02-15 13:20:14 Re: snowball update
Previous Message Daniel Gustafsson 2021-02-15 13:02:02 Re: Online checksums patch - once again