speed up verifying UTF-8

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: speed up verifying UTF-8
Date: 2021-06-02 16:26:41
Message-ID: CAFBsxsHii1-wbwN7vEbpzK03VJJL=EXegJSz6RSXbXZeaUB2jA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

For v10, I've split the patch up into two parts. 0001 uses pure C
everywhere. This is much smaller and easier to review, and gets us the most
bang for the buck.

One concern Heikki raised upthread is that platforms with poor
unaligned-memory access will see a regression. We could easily add an
#ifdef to take care of that, but I haven't done so here.

To recap: On ascii-only input with storage taken out of the picture,
profiles of COPY FROM show a reduction from nealy 10% down to just over 1%.
In microbenchmarks found earlier in this thread, this works out to about 7
times faster. On multibyte/mixed input, 0001 is a bit faster, but not
really enough to make a difference in copy performance.

0002 adds the SSE4 implementation on x86-64, and is equally fast on all
input, at the cost of greater complexity.

To reflect the split, I've changed the thread subject and the commitfest
title.
--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v10-0001-Rewrite-pg_utf8_verifystr-for-speed.patch application/octet-stream 9.7 KB
v10-0002-Use-SSE-instructions-for-pg_utf8_verifystr-where.patch application/octet-stream 46.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Tiikkaja 2021-06-02 16:36:39 Re: security_definer_search_path GUC
Previous Message Matthias van de Meent 2021-06-02 15:48:38 Re: pg_stat_progress_create_index vs. parallel index builds