Re: Perform COPY FROM encoding conversions in larger chunks

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Perform COPY FROM encoding conversions in larger chunks
Date: 2021-01-28 11:36:04
Message-ID: 06d45421-61b8-86dd-e765-f1ce527a5a2f@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 28/01/2021 01:23, John Naylor wrote:
> Hi Heikki,
>
> 0001 through 0003 are straightforward, and I think they can be committed
> now if you like.

Thanks for the review!

I did some more rigorous microbenchmarking of patch 1 and 2. I used the
attached test script, which calls convert_from() function to perform
UTF-8 verification on two large strings, about 60kb each. One of the
strings is pure ASCII, and the other is an HTML page that contains a mix
of ASCII and multibyte characters.

Compiled with "gcc -O2", gcc version 10.2.1 20210110 (Debian 10.2.1-6)

| mixed | ascii
-----------+-------+-------
master | 1866 | 1250
patch 1 | 959 | 507
patch 1+2 | 1396 | 987

So, the first patch,
0001-Add-new-mbverifystr-function-for-each-encoding.patch, made huge
difference. Even with pure ASCII input. That's very surprising, because
there is already a fast-path for pure-ASCII input in pg_verify_mbstr_len().

Even more surprising was that the second patch
(0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch)
actually made things worse again. I thought it would give a modest gain,
but nope.

It seems to me that GCC is not doing good job at optimizing the loop in
pg_verify_mbstr(). The first patch fixes that, but the second patch
somehow trips up GCC again.

So I also tried this with "gcc -O3" and clang:

Compiled with "gcc -O3"

| mixed | ascii
-----------+-------+-------
master | 1522 | 1225
patch 1 | 753 | 507
patch 1+2 | 868 | 507

Compiled with "clang -O2", Debian clang version 11.0.1-2

| mixed | ascii
-----------+-------+-------
master | 1257 | 520
patch 1 | 899 | 507
patch 1+2 | 884 | 508

With gcc -O3, the results are a better, but still the second patch seems
harmful. With clang, I got the result I expected: Almost no difference
with pure-ASCII input, because there's already a fast-path for that, and
a nice speedup with multibyte characters. Still, I was surprised how big
the speedup from the first patch was, and how little additional gain the
second patch gives.

Based on these results, I'm going to commit the first patch, but not the
second one. There are much faster UTF-8 verification routines out there,
using SIMD instructions and whatnot, and we should consider adopting one
of those, but that's future work.

- Heikki

Attachment Content-Type Size
mbverifystr-speed.sql application/sql 942 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-01-28 11:37:46 Commitfest 2021-01 ends in 3 days
Previous Message Hou, Zhijie 2021-01-28 11:30:43 RE: Determine parallel-safety of partition relations for Inserts