Re: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Hans Buschmann <buschmann(at)nidsa(dot)net>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "ranier(dot)vf(at)gmail(dot)com" <ranier(dot)vf(at)gmail(dot)com>
Subject: Re: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode
Date: 2022-01-03 18:34:03
Message-ID: CAFBsxsG4OWHBbSDM=sSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 31, 2021 at 9:32 AM Hans Buschmann <buschmann(at)nidsa(dot)net> wrote:

> Inspired by the effort to integrate JIT for executor acceleration I thought selected simple algorithms working with array-oriented data should be drastically accelerated by using SIMD instructions on modern hardware.

Hi Hans,

I have experimented with SIMD within Postgres last year, so I have
some idea of the benefits and difficulties. I do think we can profit
from SIMD more, but we must be very careful to manage complexity and
maximize usefulness. Hopefully I can offer some advice.

> - restrict on 64 -bit architectures
> These are the dominant server architectures, have the necessary data formats and corresponding registers and operating instructions
> - start with Intel x86-64 SIMD instructions:
> This is the vastly most used platform, available for development and in practical use
> - don’t restrict the concept to only Intel x86-64, so that later people with more experience on other architectures can jump in and implement comparable algorithms
> - fallback to the established implementation in postgres in non appropriate cases or on user request (GUC)

These are all reasonable goals, except GUCs are the wrong place to
choose hardware implementations -- it should Just Work.

> - coding for maximum hardware usage instead of elegant programming
> Once tested, the simple algorithm works as advertised and is used to replace most execution parts of the standard implementaion in C

-1

Maintaining good programming style is a key goal of the project. There
are certainly non-elegant parts in the code, but that has a cost and
we must consider tradeoffs carefully. I have read some of the
optimized code in glibc and it is not fun. They at least know they are
targeting one OS and one compiler -- we don't have that luxury.

> - focus optimization for the most advanced SIMD instruction set: AVX512
> This provides the most advanced instructions and quite a lot of large registers to aid in latency avoiding

-1

AVX512 is a hodge-podge of different instruction subsets and are
entirely lacking on some recent Intel server hardware. Also only
available from a single chipmaker thus far.

> - The loops implementing the algorithm are written in NASM assembler:
> NASM is actively maintained, has many output formats, follows the Intel style, has all current instrucions implemented and is fast.

> - The loops are mostly independent of operating systems, so all OS’s basing on a NASM obj output format are supported:
> This includes Linux and Windows as the most important ones

> - The algorithms use advanced techniques (constant and temporary registers) to avoid most unnessary memory accesses:
> The assembly implementation gives you the full control over the registers (unlike intrinsics)

On the other hand, intrinsics are easy to integrate into a C codebase
and relieve us from thinking about object formats. A performance
feature that happens to work only on common OS's is probably fine from
the user point of view, but if we have to add a lot of extra stuff to
make it work at all, that's not a good trade off. "Mostly independent"
of the OS is not acceptable -- we shouldn't have to think about the OS
at all when the coding does not involve OS facilities (I/O, processes,
etc).

> As an example I think of pg_dump to dump a huge amount of bytea data (not uncommon in real applications). Most of these data are in toast tables, often uncompressed due to their inherant structure. The dump must read the toast pages into memory, decompose the page, hexdump the content, put the result in an output buffer and trigger the I/O. By integrating all these steps into one big performance improvements can be achieved (but naturally not here in my first implementation!).

Seems like a reasonable area to work on, but I've never measured.

> The best result I could achieve was roughly 95 seconds for 1 Million dumps of 1718 KB on a Tigerlake laptop using AVX512. This gives about 18 GB/s source-hexdumping rate on a single core!
>
> In another run with postgres the time to hexdump about half a million tuples with a bytea column yeilding about 6 GB of output reduced the time from about 68 seconds to 60 seconds, which clearly shows the postgres overhead for executing the copy command on such a data set.

I don't quite follow -- is this patched vs. unpatched Postgres? I'm
not sure what's been demonstrated.

> The assembler routines should work on most x86-64 operating systems, but for the moment only elf64 and WIN64 output formats are supported.
>
> The standard calling convention is followed mostly in the LINUX style, on Windows the parameters are moved around accordingly. The same assembler-source-code can be used on both platforms.

> I have updated the makefile to include the nasm command and the nasm flags, but I need help to make these based on configure.
>
> I also have no knowledge on other operating systems (MAC-OS etc.)
>
> The calling conventions can be easily adopted if they differ but somebody else should jump in for testing.

As I implied earlier, this is way too low-level. If we have to worry
about obj formats and calling conventions, we'd better be getting
something *really* amazing in return.

> But I really need help by an expert to integrate it in the perl building process.

> I would much appreciate if someone else could jump in for a patch to configure-integration and another patch for .vcxobj integration.

It's a bit presumptuous to enlist others for specific help without
general agreement on the design, especially on the most tedious parts.
Also, here's a general engineering tip: If the non-fun part is too
complex for you to figure out, that might indicate the fun part is too
ambitious. I suggest starting with a simple patch with SSE2 (always
present on x86-64) intrinsics, one that anyone can apply and test
without any additional work. Then we can evaluate if the speed-up in
the hex encoding case is worth some additional complexity. As part of
that work, it might be good to see if some portable improved algorithm
is already available somewhere.

> There is much room for other implementations (checksum verification/setting, aggregation, numeric datatype, merging, generate_series, integer and floating point output …) which could be addressed later on.

Float output has already been pretty well optimized. CRC checksums
already have a hardware implementation on x86 and Arm. I don't know of
any practical workload where generate_series() is too slow.
Aggregation is an interesting case, but I'm not sure what the current
bottlenecks are.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-01-03 18:35:33 Re: Remove inconsistent quotes from date_part error
Previous Message Alvaro Herrera 2022-01-03 18:30:49 Re: [PATCH] pg_stat_toast v0.4