Re: Improving spin-lock implementation on ARM.

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Krunal Bauskar <krunalbauskar(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving spin-lock implementation on ARM.
Date: 2020-12-03 20:00:39
Message-ID: CAPpHfdtQmq1uLK0P3Bo0yErYvYA0gY_79Z0CR7PcmynLYyDViQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar <krunalbauskar(at)gmail(dot)com> wrote:
> 1. CAS patch (applied on the baseline)
> - Kunpeng: 10-45% improvement observed [1]
> - Graviton2: 30-50% improvement observed [2]

What does lower boundary of improvement mean? Does it mean minimal
improvement observed? Obviously not, because there is no improvement
with a low number of clients or read-only benchmark.

> - M1: Only select results are available cas continue to maintain a marginal gain but not significant. [3]

Read-only benchmark doesn't involve spinlocks (I've just rechecked
this). So, this difference is purely speculative.

Also, Tom observed the regression [1]. The benchmark is artificial,
but it may correspond to some real workload with heavily-loaded
spinlocks. And that might have an explanation. ldrex/strex
themselves don't work as memory barrier (this is why compiler adds
explicit memory barrier afterwards). And I bet ldrex unpaired with
strex could see an outdated value. On high-contended spinlocks that
may cause too pessimistic waits.

> 2. Let's ignore CAS for a sec and just think of LSE independently
> - Kunpeng: regression observed

Yeah, it's sad that there are ARM chips, where making the same action
in a single instruction is slower than the same actions in multiple
instructions.

> - Graviton2: gain observed

Have to say it's 5.5x improvement, which is dramatically more than
CAS-loop patch can give.

> - M1: regression observed
> [while lse probably is default explicitly enabling it with +lse causes regression on the head itself [4].
> client=2/4: 1816/714 ---- vs ---- 892/610]

This is plain false. LSE was enabled in all the versions tested in M1
[2]. So not LSE instructions or +lse option caused the regression,
but lack of other options enabled by default in Apple's clang.

> Let me know what do you think about this analysis and any specific direction that we should consider to help move forward.

In order to move forward with ARM-optimized spinlocks I would
investigate other options (at least [3]).

Regarding CAS spinlock patch I can propose the following steps to
clarify the things.
1. Run Tom's spinlock test on ARM machines other than M1.
2. Try to construct a pgbench script, which produces heavily-loaded
spinlock without custom C-function, and also run it across various ARM
machines.

Links
1. https://www.postgresql.org/message-id/741389.1606530957%40sss.pgh.pa.us
2. https://www.postgresql.org/message-id/1274781.1606760475%40sss.pgh.pa.us
3. https://linux-concepts.blogspot.com/2018/05/spinlock-implementation-in-arm.html

------
Regards,
Alexander Korotkov

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2020-12-03 20:32:48 Re: Github Actions (CI)
Previous Message Peter Eisentraut 2020-12-03 19:46:09 Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly