Re: Improving spin-lock implementation on ARM.

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Krunal Bauskar <krunalbauskar(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving spin-lock implementation on ARM.
Date: 2020-11-26 11:31:29
Message-ID: CAPpHfdvTd_VUWC+iK+qWtw14v9Hi6pTSfzaCXNdJoDoung7=GQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> On 26/11/2020 06:30, Krunal Bauskar wrote:
> > Improving spin-lock implementation on ARM.
> > ------------------------------------------------------------
> >
> > * Spin-Lock is known to have a significant effect on performance
> > with increasing scalability.
> >
> > * Existing Spin-Lock implementation for ARM is sub-optimal due to
> > use of TAS (test and swap)
> >
> > * TAS is implemented on ARM as load-store so even if the lock is not free,
> > store operation will execute to replace the same value.
> > This redundant operation (mainly store) is costly.
> >
> > * CAS is implemented on ARM as load-check-store-check that means if the
> > lock is not free, check operation, post-load will cause the loop to
> > return there-by saving on costlier store operation. [1]
>
> Can you add some code comments to explain that why CAS is cheaper than
> TAS on ARM?
>
> Is there some official ARM documentation, like a programmer's reference
> manual or something like that, that would show a reference
> implementation of a spinlock on ARM? It would be good to refer to an
> authoritative source on this.

Let me add my 2 cents.

I've compared assembly output of gcc implementations of CAS and TAS.
The sample C-program is attached. I've compiled it on raspberry pi 4
using gcc 9.3.0.

The inner loop of CAS is as follows. So, if the value loaded by ldaxr
doesn't match expected value, then we immediately quit the loop.

.L3:
ldxr w3, [x0]
cmp w3, w1
bne .L4
stlxr w4, w2, [x0]
cbnz w4, .L3
.L4:

The inner loop of TAS is as follows. So it really does "stxr"
unconditionally. In principle, it could check if a new value matches
the observed value and there is nothing to do, but it doesn't.
Moreover, stxr might fail, then we can spend multiple loops of
ldxr/stxr due to concurrent memory access. AFAIK, those concurrent
accesses could reflect not only lock release, but also other
unsuccessful lock attempts. So, good chance for extra loops to be
useless.

.L7:
ldxr w2, [x0]
stxr w3, w1, [x0]
cbnz w3, .L7

I've also googled for spinlock implementation on arm and found a blog
post about spinlock implementation in linux kernel [1]. Surprisingly
it doesn't use the trick to skip stxr if the lock is busy. Instead,
they use some arm-specific power-saving option called WFE.

So, I think it's quite clear that switching from TAS to CAS on arm
would be a win. But there could be other options to do this even
better.

Links
1. https://linux-concepts.blogspot.com/2018/05/spinlock-implementation-in-arm.html

------
Regards,
Alexander Korotkov

Attachment Content-Type Size
test.c application/octet-stream 319 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2020-11-26 11:36:19 Re: Multi Inserts in CREATE TABLE AS - revived patch
Previous Message Bharath Rupireddy 2020-11-26 11:06:00 Re: Multi Inserts in CREATE TABLE AS - revived patch