Re: Improving spin-lock implementation on ARM.

From: Krunal Bauskar <krunalbauskar(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving spin-lock implementation on ARM.
Date: 2020-11-30 03:59:37
Message-ID: CAB10pyajgoCBSCoQ7MvX1_fmh5x8x2qhvHB96t18OSZ_U40NQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, 29 Nov 2020 at 22:23, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:

> On Sat, Nov 28, 2020 at 1:31 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> > I guess that might depend on the implementation of CAS and TAS. I bet
> > usage of CAS in spinlock gives advantage when ldxr/stxr are used, but
> > not when swpal/casa are used. I found out that I can force clang to
> > use swpal/casa by setting "-march=armv8-a+lse". I'm going to make
> > some experiments on a multicore AWS graviton2 instance with different
> > atomic implementation.
>
> I've made some benchmarks on c6gd.16xlarge ec2 instance with graviton2
> processor of 64 virtual CPUs (graphs and raw results are attached).
> I've analyzed two patches: spinlock using cas by Krunal Bauskar, and
> my implementation of lwlock using lwrex/strex. My arm lwlock patch
> has the same idea as my previous patch for power: we can put lwlock
> attempt logic between lwrex and strex. In spite of my previous power
> patch, the arm patch doesn't contain assembly: instead I've used
> C-wrappers over lwrex/strex.
>
> The first series of experiments I've made using standard compiling
> options. So, LSE instructions from ARM v8.1 weren't used. Atomics
> were implemented using lwrex/strex pair.
>
> In the read-only benchmark, both spinlock (cas-spinlock graph) and
> lwlock (ldrew-strex-lwlock graph) patches give observable performance
> gain of similar value. However, performance of combination of these
> patches (ldrew-strex-lwlock-cas-spinlock graph) is close to
> performance of unpatched version. That could be counterintuitive, but
> I've rechecked that multiple times.
>
> In the read-write benchmark, both spinlock and lwlock patches give
> more significant performance gain, and lwlock patch gives more effect
> than spinlock patch. Noticeable, that combination of patches now
> gives some cumulative effect instead of counterintuitive slowdown.
>
> Then I've tried to compile postgres with LSE instruction using
> "-march=armv8-a+lse" flag with clang (graphs with -lse suffix). The
> effect of LSE is HUGE!!! Unpatched version with LSE is times faster
> than any version without LSE on high concurrency. In the both
> read-only and read-write benchmarks spinlock patch doesn't show any
> significant difference. The lwlock patch shows a great slowdown with
> LSE. Noticeable, in read-write benchmark, lwlock patch shows worse
> results than unpatched version without LSE. Probably, combining
> different atomics implementations isn't a good idea.
>
> It seems that ARM Kunpeng 920 should support ARM v8.1. I wonder if
> the published benchmarks results were made with LSE. I suspect that
> it was not. It would be nice to repeat the same benchmarks with LSE.
> I'd like to ask Krunal Bauskar and Amit Khandekar to repeat these
> benchmarks with LSE.
>
> My preliminary conclusions are so:
> 1) Since the effect of LSE is so huge, we should advise users of
> multicore ARM servers to compile PostgreSQL with LSE support. We
> probably should provide separate packaging for ARM v8.1 and higher
> (packages for ARM v8 are still needed for raspberry etc).
> 2) It seems that atomics in ARM v8.1 becomes very similar to x86
> atomics, and it doesn't need special optimizations. And I think ARM
> v8 processors don't have so many cores and aren't so heavily used in
> high-concurrent environments. So, special optimizations for ARM v8
> probably aren't worth it.
>

Thanks for the detailed results.

1. Results we shared are w/o lse enabled so using traditional store/load
approach.

2. As you pointed out LSE is enabled starting only with arm-v8.1 but not
all aarch64 tag machines are arm-v8.1 compatible.
This means we would need a separate package or a more optimal way would
be to compile pgsql with gcc-9.4 (or gcc-10.x (default)) with
-moutline-atomics that would emit both traditional and lse code and
flow would dynamically select depending on the target machine
(I have blogged about it in MySQL context
https://mysqlonarm.github.io/ARM-LSE-and-MySQL/)

3. Problem with GCC approach is still a lot of distro don't support gcc 9.4
as default.
To use this approach:
* PGSQL will have to roll out its packages using gcc-9.4+ only so that
they are compatible with all aarch64 machines
* but this continues to affect all other users who tend to build pgsql
using standard distro based compiler. (unless they upgrade compiler).

--------------------

So given all the permutations and combinations, I think we could approach
the problem as follows:

* Enable use of CAS as it is known to have optimal performance (vs TAS)

* Even with LSE enabled, CAS to continue to perform (on par or marginally
better than TAS)

* Add a patch to compile pgsql with outline-atomics if set GCC supports it
so the dynamic 2-way compatible code is emitted.

--------------------

Alexander,

We will surely benchmark using LSE on Kunpeng 920 and share the result.

I am a bit surprised to see things scale by 4-5x times just by switching to
LSE.
(my working experience with lse (in mysql context and micro-benchmarking)
didn't show that great improvement by switching to lse).
Maybe some more hotspots (beyond s_lock) are getting addressed with the use
of lse.

>
> Links
> 1.
> https://www.postgresql.org/message-id/CAB10pyamDkTFWU_BVGeEVmkc8%3DEhgCjr6QBk02SCdJtKpHkdFw%40mail.gmail.com
> 2.
> https://www.postgresql.org/message-id/CAPpHfdsKrh7c7P8-5eG-qW3VQobybbwqH%3DgL5Ck%2BdOES-gBbFg%40mail.gmail.com
>
> ------
> Regards,
> Alexander Korotkov
>

--
Regards,
Krunal Bauskar

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2020-11-30 03:59:38 Notes on physical replica failover with logical publisher or subscriber
Previous Message Fujii Masao 2020-11-30 03:56:00 Re: Fix typo in cost.h