|From:||Hans Buschmann <buschmann(at)nidsa(dot)net>|
|Cc:||"michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "ranier(dot)vf(at)gmail(dot)com" <ranier(dot)vf(at)gmail(dot)com>, Hans Buschmann <buschmann(at)nidsa(dot)net>|
|Subject:||Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
Inspired by the effort to integrate JIT for executor acceleration I thought selected simple algorithms working with array-oriented data should be drastically accelerated by using SIMD instructions on modern hardware.
I want to introduce this style of programming with the example of hex_encode:
- operates on arrays (bytea)
- simple algorithm
- in some situations partially limiting the performance (e.g pg_dump)
The main goal ist to accelerate common cases on the most common hardware by exploiting all the resources the hardware delivers.
The following guidelines took me to a first implementation:
- restrict on 64 -bit architectures
These are the dominant server architectures, have the necessary data formats and corresponding registers and operating instructions
- start with Intel x86-64 SIMD instructions:
This is the vastly most used platform, available for development and in practical use
- don’t restrict the concept to only Intel x86-64, so that later people with more experience on other architectures can jump in and implement comparable algorithms
- fallback to the established implementation in postgres in non appropriate cases or on user request (GUC)
- implementation of leaf function/procedures in assembly language
These consist mostly of a central loop without calling subroutines or doing additionally branching
- coding for maximum hardware usage instead of elegant programming
Once tested, the simple algorithm works as advertised and is used to replace most execution parts of the standard implementaion in C
- isolated footprint by integrating it only in the specific subroutine (here hex-encode)
This ensures that the requirements for fast execution are met (e.g. buffer sizes) and no repeated checks are needed like in a library use case.
- trying to keep both vector execution ports always doing useful work by avoiding waits for latencies
- trying to access memory in linear fashion (reading from input buffer, writing to output buffer) to avaoid internal cache problems
- focus optimization for the most advanced SIMD instruction set: AVX512
This provides the most advanced instructions and quite a lot of large registers to aid in latency avoiding
- if possible provide fallback implementations on older SIMD standards (e.g. AVX2 or SSE2)
This is usefull on many older servers and client processors, but due to their too little number of registers latency avoiding or full execution queues cannot be fully achieved.
- The loops implementing the algorithm are written in NASM assembler:
NASM is actively maintained, has many output formats, follows the Intel style, has all current instrucions implemented and is fast.
- The loops are mostly independent of operating systems, so all OS’s basing on a NASM obj output format are supported:
This includes Linux and Windows as the most important ones
- The algorithms use advanced techniques (constant and temporary registers) to avoid most unnessary memory accesses:
The assembly implementation gives you the full control over the registers (unlike intrinsics)
- Multiple dependency chains work interleaved to minimize latencies:
Coding is often interspersed and using almost all registers available.
- Some instructions (Moves, zeroing) are executed outside the processor execution ports:
These don’t consume execution cyles on a port but their latency has to be considered.
- Some vector instructions (multiply add) have latencies of 5 for example:
This means that after the instruction is issued, the processor has to wait 5 cycles until the result can be used in the same dependency chain. To avoid this and keep all vector execution ports (p0 and p5) busy you have to have 9 other instructions in between doing work on other streams of the algorithm to maximize hardware usage and overall performance.
- All loops are implemented as separate C-callable functions (according to the OS calling convention):
They are all leaf functions by calling no other subroutines.
- The decision which implementation is choosen is done at the caller side by a special dispatcher routine:
The caller handles the architectural capabilites (instruction sets available) and knows the required work: There is often a suitable minimum amount of work required for efficently calling a provided implementation.
- Loops should be run at least 2-4 times to compensate for initializing overhead:
This implicits a certain amount of minimum work count based on the specific SIMD implementations
- The loops terminate after detecting an error (e.g. wrong input data) and return the succesfull completed amount of work:
The standard linear implementation takes over with the already established error-handling.
- The loops work optimally with some extra output buffer space at the end to be able to overshoot in the last round:
Nonethless the correct amount of work is returned to the caller and a vector size of output buffer following the real result is zeroed out (Currently disabled!)
- the loop may preload some data after the input buffer but assures that the following page boundary is never crossed to avoid any access violation:
This makes no harm to the memory system because the output buffer has a supplemental buffer at the end, but this could be changed to leaving the tail handling to the standard implementaion if deemed unsupportable (as for now).
(to be continued...)
|Next Message||Hans Buschmann||2021-12-31 15:34:55||AW: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode|
|Previous Message||Alvaro Herrera||2021-12-31 14:14:34||Re: Adding CI to our tree|