AW: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode

From: Hans Buschmann <buschmann(at)nidsa(dot)net>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc: "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "ranier(dot)vf(at)gmail(dot)com" <ranier(dot)vf(at)gmail(dot)com>, Hans Buschmann <buschmann(at)nidsa(dot)net>
Subject: AW: Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode
Date: 2021-12-31 15:34:55
Message-ID: 1640964901207.32208@nidsa.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(continued)

PERFORMANCE

First Level: execution units (focused on AVX512)

Every modern processor has at least 2 vector execution units (p1 and p5 on Intel) which execute a different set of instructions in a pipelined fashion. Some simple classes of instructions (logical, arithmetic) can be executed on both ports. The results of a short operation are available in the next cycle for another instruction, which in its total form a dependancy chain.
Longer executions provide their results only after some more clock cycles, so the latency increases from at least 1 to higher numbers.

This constellation implies that a single dependancy chain never can exhaust the full processor capabilities. To fight these latencies multiple interleaving dependancy chains should be used.
Instructions with long latencies (e.g. memory accesses) should be issued long in advance before using their results.

In most cases only the two vector execution ports are the ultimate bottleneck, since the processor can execute memory reads, memory writes, scalar instructions and branches on other specialized units or avoid them totaly (register zeroing).

The hex_encode algorithm executes 5 instructions (uops to be correct) on p5, 3 on p1 (or arbitrary) and 1 load and 2 store uops.

Assuming a processor with 2.5 GHz (for simplicity) we have 0.5 billion vectors processed per second, which gives 64bytes*0.5 billion=32 GB maximum processed per second.
In normal database units this is really a HUGE number (and it is using only ONE core!).
But in this case the doubled amount of results inc comparison to the source has to be written to memory (64GB/sec), which exceeds the possibilities of normal desktop processors.

As another example I may present the checksum algorithm, which is only read-intensive and uses 3 uops on p5 as the bottleneck path/vector.

On a 3GHz processor checksum can process 64 GB per sec and core.

It is interesting to check the performance levels on the upcoming new generation of XEON (Sapphire Rapids in 2022), which will have much increased memory bandwith (8channels DDR5, up to 12 on AMD Genoa), which will have some special models with HBM2 memory stacks and which has 2 execution units for stores to match the read capacity of also 2 instructions/cycle.

Second Level: alignment, caches and memory

Elder processor generations had a VERY big impact for page split accesses to memory, which will occur on vector data when they are not naturally aligned.

Giving the future developments I would consider even 128 byte or 256 byte alignment, since it may be possible to get 1024 or 2048 bit vectors (already specified in ARM architecture).

On the level of caches one must consider „cache thrashing“ when the accesses to the caches exceed the associative maximum af a cache. In some algorithms (very high parallel checksum calculations with copy function) you could overload a single cacheline address in the case of too many parallel accesses. In these cases you can start the algorithm (on the fixed size blocks) a little bit delayed, so that some algorithm chains access vector n and some others vectors n+1 interleaved in the execution loop.

Memory should be accessed in natural order to maximize the use of processor cache prefetching.

All accesses should be optimized to use the registers where possible: long latencies of memory access and some initial instructions can be combined in early issued instructions used only much later in time.

The memory latencies lead to data preloading, where the data for the next round of the loop are loaded at the first possible moment when target registers are available. This is crucial for latency fighting in many algorithms.

Third level: Internal data structures

Vector operations work best with array oriented structures (here a bytea datatype or a shared buffer block for checksum calculation).
Clobbering individual scalar data (32/64 bit scalars) into vectors is much slower and really stresses the memory subsystem.

This implies a focus on more „struct of arrays“ as „array of structures“ in postgres, which seems difficult in postgres due to its established structure and long heritage.

By exploring the code more deeply (than my knowledge so far) it should be easy to identify many more places for simple algorithms working on array structures.

Fourth Level: Vertical integration

The base of most algorithm is the loading of the data into registers, doing some algorithmic calculations and write it out.
Subsequent steps are coded in another layer (e.g. copying to storage, trimming the data for output etc.). This often requires to read the data again and do some other transformations.

Vertical integration combines some simple steps for better memory utilization.

As an example I think of pg_dump to dump a huge amount of bytea data (not uncommon in real applications). Most of these data are in toast tables, often uncompressed due to their inherant structure. The dump must read the toast pages into memory, decompose the page, hexdump the content, put the result in an output buffer and trigger the I/O. By integrating all these steps into one big performance improvements can be achieved (but naturally not here in my first implementation!).

Fifth Level: Pooling

Some algorithm are so fast that they need to work on multiple datastreams at once to fully utilize a processor core. One example is checksum calculation.
To saturate the processor capabilities with large vectors you have to do the checksum on multiple pages in parallel (e.g. 2, 4 or 8).
This occurs often in real life (loading shared buffers to memory, flushing shared buffers to disk, precaching the shared buffers etc.).

Some pooling (collect up to 16 shared buffer blocks in a pool) allows a fast checksumming for blocks which are now processed in a serial fashion.
This requires some adoption at some isolated parts in the postgres code base and turns a serial procedure into a parallel processing for objects treated in the same fashion at the (nearly) same time.

BENCHMARKS:

I have included a little benchmark program. It is not very sophisticated and fancy, but allows to estimate the performance of commonly used processors.

It requires nasm to be installed/downloaded (on linux or Windows).

It executes the hexdump algorithm one million times on the binary of nasm (2.15.05 current version).

The benchmark simply runs (for about 1000 sec), the user has to count the time himself.
The binary of nasm (using it as benchmark source data) is included as the source data in

HEX_BENCH_DATA_1300KB.asm

(please adjust the location where you downloaded nasm.exe on windows).

The binary (of each architecture) has a size of 1356 KB on windows and 1718 KB on linux.

The commands to build the binary are (also found in hex_bench.asm)

on Windows:

:: commands to build on Windows (nasm and golink in the path)
nasm -f WIN64 -g hex_bench.asm -l hex_bench.lis
nasm -f WIN64 -g hex_x86_64.asm -l hex_x86_64.lis
nasm -f WIN64 -g HEX_BENCH_DATA_1300KB.asm
golink /console hex_bench.obj hex_x86_64.obj HEX_BENCH_DATA_1300KB.obj

Golink is a small utility linker on Windows found under:

http://www.godevtool.com/

on Linux:

# commands to build on LINUX
nasm -f elf64 -g hex_bench.asm -l hex_bench.lis
nasm -f elf64 -g hex_x86_64.asm -l hex_x86_64.lis
nasm -f elf64 -g HEX_BENCH_DATA_1300KB.asm
ld -o hex_bench hex_bench.o hex_x86_64.o HEX_BENCH_DATA_1300KB.o

The selected hex_encode_routine is hardcoded to hex_encode_avx512bw (please choose another implementation on processors not supporting AVX512 by changing the comments in hex_bench.asm)

The best result I could achieve was roughly 95 seconds for 1 Million dumps of 1718 KB on a Tigerlake laptop using AVX512. This gives about 18 GB/s source-hexdumping rate on a single core!

In another run with postgres the time to hexdump about half a million tuples with a bytea column yeilding about 6 GB of output reduced the time from about 68 seconds to 60 seconds, which clearly shows the postgres overhead for executing the copy command on such a data set.

SQL> Copy col_bytearray from my_table to 'N:/ZZ_SAV/my_hexdump.sql';

(This was on a customers dataset not reproduced here).

POSTGRES INTEGRATION (HELP NEEDED)

The architecture-dependant introduction of vector routines requires some integration efforts into Postgres.

I have designed a concept for easy integration and extensibility, but some concrete steps need support from others due to my restricted knowledge of the whole system.

(For now this global configuration is part on the top of encode.c, but it certainly must be moved to a more adequate place for initialization).

The main concept tries to match the CPU capabilities with the requirements of a certain implementation. This is not only for hex_encode but for an arbitrary number of algorithms implemented in an accelerated version (here SIMD vectors, but others may be possible too).

We have a global array called valid_impl_id_arr indicating all the implementations capabable running on the current CPU.

An implementor defines an algorithm and gets an invariant ID (here ALGORITHM_ID_HEX_ENCODE, should be kept in a global header).

These Ids are valid for all architectures, even if there exists no accelerated version yet.

In internal arrays (see hex_x86_64.asm) all possible implementations are stored according with their requirements (CPU features, minimum length etc.).

In the initialization phase of the running exe (backend or forend in the future) the current cpu_capabilities are checked once and the maximum valid implementation index is stored in the global visible valid_impl_id_arr.

The highest requirements have the highest Index, so the capabilities are checked in decreasing index order.

For example (hex_encode): We have 4 implementations, but on an AVX2-only machine the valid_impl_id_arr[ALGORITHM_ID_HEX_ENCODE] is only set to 3, because the requirements of AVX512BW are not met. There is always index zero indicating the algorithm has no valid implementation or the CPU has no sufficiant capabilities.

To disable an algorithm totally from being accelerated the masking by an algotithm_disable_mask is provided, which is normally all zero but could be set to disable a certain amount of algorithms by ORing (1«ALGORITHM_ID_constants). This emergency disablement should be kept in a GUC and applied only at image initialization time.

The CPU-capabilites are determined by cpuid instructions (on x86-64) and defined in cpu_capabilties_x86_64.asm.

But this scheme is not restricted to Intel ISA only. Other hardware architectures (most probably ARM, POWER or RISCV) are identified by different CPU_IS_ARCH_xxx constants (numbers from 1-7) and implementers get the specific CPU capabilities in their own fashion which may be totally different to Intel ISA.

So every CPU has its cpu_capabilities_unmasked value as an unique int64 value.
This value is normally copied 1:1 to the global cpu_capabilities, but for testing or in emergency it is masked by a configuration mask simulating a certain CPU. This allows a developer to test the implementations for lower-class cpus without the need for the specific hardware.
This cpu_capabilities_mask defaults to -1 (all bits 1) and should be derived also from a GUC.

For up to 63 algorithms we need 2 int64 GUC values to selectively disable certain parts of accelerated implementation.

Help is greatly appreciated to code this concepts with GUCs and put the globals and their initialization at the right place.

TOOL CHAIN (HELP NEEDED)

On x86-64 I use nasm (Netwide assembler) because its well maintained, fast, instruction complete and covers multiple object format.

The assembler routines should work on most x86-64 operating systems, but for the moment only elf64 and WIN64 output formats are supported.

The standard calling convention is followed mostly in the LINUX style, on Windows the parameters are moved around accordingly. The same assembler-source-code can be used on both platforms.

Website for downloading the win binary / rpm repo

https://nasm.us/

I have updated the makefile to include the nasm command and the nasm flags, but I need help to make these based on configure.

I also have no knowledge on other operating systems (MAC-OS etc.)

The calling conventions can be easily adopted if they differ but somebody else should jump in for testing.

If absolutely needed, nasm allows cross-assembling for a different platform, so the objects could be provided in a library for these cases.

For Windows the nasm support must be integrated into the generation of the *.vcxproj for Visual Studio.

I found the VSNASM project on github which explains how to integrate Nasm into VisualStudio.

https://github.com/ShiftMediaProject/VSNASM

But I really need help by an expert to integrate it in the perl building process.

My internal development on windows is using manually assembly/linking so far.

I would much appreciate if someone else could jump in for a patch to configure-integration and another patch for .vcxobj integration.

OUTLOOK

Once the toolchain and global postgres integration is done (these are totally new for me) this kind of vector (or matrix perhaps) acceleration is quite easy.

By identifying simple algorithms and using some architecture knowledge of the choosen platform a new implementation is easily coded and debugged because the complexity is often limited (performance optimization may be a challenge).

The integration to postgres remains quite locally and is not very invasive.

The acceleration for the specific algorithm is really huge, despite it puts the focus on other bottlenecks in the current code base. This makes the base algorithms almost disappear in CPU-usage and extends the scale to the dimensions of Terabytes.

The whole architecture is thereby not limited to Intel ISA (even if this is certainly the most common real world use case) and can be easily adopted to other hardware architectures.

I have some other algorithms already in the pipeline, formost hex_decode (which must be debugged and checked for error handling), but during implementation i stumbled over base64_encode/decode which has also its implementations coded.

I only want to start with a first project (hex_encode/hex_decode) targetting PG15 if possible and approved by the community. Then I’ll try to polish/debug/document the whole project to finish it to a committable state.

There is much room for other implementations (checksum verification/setting, aggregation, numeric datatype, merging, generate_series, integer and floating point output …) which could be addressed later on.

Due to my different background (not really a C hacker) I need some help from some experts in specific areas. For coding Intel vector assembly for the project I can provide some help with tips and revisions.

I have CC-included some people of the project who offered help or where already involved in this coding area.

Thank you all very much for your patience with this new project

Hans Buschmann

Attachment Content-Type Size
0001_hex_encode.patch application/octet-stream 121.0 KB
hex_bench.tar application/x-tar 110.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dagfinn Ilmari Mannsåker 2021-12-31 16:20:03 Re: Extend compatibility of PostgreSQL::Test::Cluster
Previous Message Hans Buschmann 2021-12-31 15:31:35 Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode