| From: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> | 
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com> | 
| Cc: | Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Smith <smithpb2250(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org | 
| Subject: | Re: define pg_structiszero(addr, s, r) | 
| Date: | 2024-11-06 11:38:30 | 
| Message-ID: | ZytVNvIbc1vc0qZz@ip-10-97-1-34.eu-west-3.compute.internal | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On Wed, Nov 06, 2024 at 12:16:33PM +1300, David Rowley wrote:
> On Wed, 6 Nov 2024 at 04:03, Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> > Another option could be to use SIMD instructions to check multiple bytes
> > is zero in a single operation. Maybe just an idea to keep in mind and experiment
> > if we feel the need later on.
> 
> Could do. I just wrote it that way to give the compiler flexibility to
> do SIMD implicitly.
ohhh, great, thanks!
> That seemed easier than messing around with SIMD
> intrinsics.
I had in mind to use SIMD intrinsics actually when posting the SIMD idea but...
> I guess the compiler won't use SIMD with the single
> size_t-at-a-time version as it can't be certain it's ok to access the
> memory beyond the first zero word. Because I wrote the "if" condition
> using bitwise-OR, there's no boolean short-circuiting, so the compiler
> sees it must be safe to access all the memory for the loop iteration.
that's a better idea! Yeah, I think that now the compiler sees that all comparisons
can be done in parallel and combined with a single OR operation (so, good candidate
to use SIMD optimization).
> If I use -march=native or -march=znver2 on my Zen2 machine, gcc does
> use SIMD operators.  Clang uses some 128-bit registers without
> specifying -march:
> 
> drowley(at)amd3990x:~$ gcc -O2 allzeros.c -march=native -o allzeros &&
> for i in {1..3}; do ./allzeros; done
> char: done in 1940539 nanoseconds
> size_t: done in 261731 nanoseconds (7.41425 times faster than char)
> size_t * 4: done in 130415 nanoseconds (14.8797 times faster than char)
> size_t * 8: done in 70031 nanoseconds (27.7097 times faster than char)
> char: done in 3030132 nanoseconds
> size_t: done in 477044 nanoseconds (6.35189 times faster than char)
> size_t * 4: done in 123551 nanoseconds (24.5254 times faster than char)
> size_t * 8: done in 68549 nanoseconds (44.2039 times faster than char)
> char: done in 3214037 nanoseconds
> size_t: done in 256901 nanoseconds (12.5108 times faster than char)
> size_t * 4: done in 126017 nanoseconds (25.5048 times faster than char)
> size_t * 8: done in 73167 nanoseconds (43.9274 times faster than char)
> 
Thanks for the tests! Out of curiosity, using gcc 11.4.0 (SIMD instructions not
generated) and get:
$ gcc -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 2655385 nanoseconds
size_t: done in 476021 nanoseconds (5.57829 times faster than char)
size_t SIMD DAVID: done in 174816 nanoseconds (15.1896 times faster than char)
or
$ gcc -march=native -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 2681146 nanoseconds
size_t: done in 395041 nanoseconds (6.78701 times faster than char)
size_t SIMD DAVID: done in 175608 nanoseconds (15.2678 times faster than char)
=> It's faster than the size_t one.
But of course, it's even faster with SIMD:
$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 5318674 nanoseconds
size_t: done in 443591 nanoseconds (11.99 times faster than char)
size_t SIMD DAVID: done in 179650 nanoseconds (29.6058 times faster than char)
or
$ /usr/local/gcc-14.1.0/bin/gcc-14.1.0 -march=native -O2 allzeros_simd.c -o allzeros_simd ; ./allzeros_simd
char: done in 5319534 nanoseconds
size_t: done in 426599 nanoseconds (12.4696 times faster than char)
size_t SIMD DAVID: done in 128687 nanoseconds (41.337 times faster than char)
So, I don't see any reason why not to use this SIMD approach: please find v7
attached.
Regards,
-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
| Attachment | Content-Type | Size | 
|---|---|---|
| v7-0001-Optimize-pg_memory_is_all_zeros.patch | text/x-diff | 3.7 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Bertrand Drouvot | 2024-11-06 11:40:06 | Re: define pg_structiszero(addr, s, r) | 
| Previous Message | Amit Kapila | 2024-11-06 11:23:19 | Re: Commit Timestamp and LSN Inversion issue |