From: | Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: SpinLockAcquire and SpinLockRelease is broken on ARM/ARM64? (Re: sinvaladt.c: remove msgnumLock, use atomic operations on maxMsgNum) |
Date: | 2025-06-02 18:20:33 |
Message-ID: | 450dafae-b620-4726-abc2-53347c419394@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Good day, Andres.
Good day, hackers.
05.05.2025 10:30, Yura Sokolov wrote:
> 21.03.2025 19:33, Andres Freund пишет:
>> Hi,
>>
>> On 2025-03-21 14:35:16 +0300, Yura Sokolov wrote:
>>> From 080c9e0de5e6e10751347e1ff50b65df424744cb Mon Sep 17 00:00:00 2001
>>> From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
>>> Date: Mon, 3 Feb 2025 11:58:33 +0300
>>> Subject: [PATCH v2] sinvaladt.c: use atomic operations on maxMsgNum
>>>
>>> msgnumLock spinlock could be highly contended.
>>> Comment states it was used as memory barrier.
>>> Lets use atomic ops with memory barriers directly instead.
>>>
>>> Note: patch uses pg_read_barrier()/pg_write_barrier() instead of
>>> pg_atomic_read_membarrier_u32()/pg_atomic_write_membarrier_u32() since
>>> no full barrier semantic is required, and explicit read/write barriers
>>> are cheaper at least on x86_64.
>>
>> Is it actually true that full barriers aren't required? I think we might
>> actually rely on a stronger ordering.
>>
>>
>>> @@ -506,10 +493,9 @@ SIGetDataEntries(SharedInvalidationMessage *data, int datasize)
>>> */
>>> stateP->hasMessages = false;
>>>
>>> - /* Fetch current value of maxMsgNum using spinlock */
>>> - SpinLockAcquire(&segP->msgnumLock);
>>> - max = segP->maxMsgNum;
>>> - SpinLockRelease(&segP->msgnumLock);
>>> + /* Fetch current value of maxMsgNum using atomic with memory barrier */
>>> + max = pg_atomic_read_u32(&segP->maxMsgNum);
>>> + pg_read_barrier();
>>>
>>> if (stateP->resetState)
>>> {
>>> /*
>>> * Force reset. We can say we have dealt with any messages added
>>> * since the reset, as well; and that means we should clear the
>>> * signaled flag, too.
>>> */
>>> stateP->nextMsgNum = max;
>>> stateP->resetState = false;
>>> stateP->signaled = false;
>>> LWLockRelease(SInvalReadLock);
>>> return -1;
>>> }
>>
>> vs
>>
>>> @@ -410,17 +398,16 @@ SIInsertDataEntries(const SharedInvalidationMessage *data, int n)
>>> /*
>>> * Insert new message(s) into proper slot of circular buffer
>>> */
>>> - max = segP->maxMsgNum;
>>> + max = pg_atomic_read_u32(&segP->maxMsgNum);
>>> while (nthistime-- > 0)
>>> {
>>> segP->buffer[max % MAXNUMMESSAGES] = *data++;
>>> max++;
>>> }
>>>
>>> - /* Update current value of maxMsgNum using spinlock */
>>> - SpinLockAcquire(&segP->msgnumLock);
>>> - segP->maxMsgNum = max;
>>> - SpinLockRelease(&segP->msgnumLock);
>>> + /* Update current value of maxMsgNum using atomic with memory barrier */
>>> + pg_write_barrier();
>>> + pg_atomic_write_u32(&segP->maxMsgNum, max);
>>>
>>> /*
>>> * Now that the maxMsgNum change is globally visible, we give everyone
>>> * a swift kick to make sure they read the newly added messages.
>>> * Releasing SInvalWriteLock will enforce a full memory barrier, so
>>> * these (unlocked) changes will be committed to memory before we exit
>>> * the function.
>>> */
>>> for (i = 0; i < segP->numProcs; i++)
>>> {
>>> ProcState *stateP = &segP->procState[segP->pgprocnos[i]];
>>>
>>> stateP->hasMessages = true;
>>> }
>>
>> On a loosely ordered architecture, the hasMessage=false in SIGetDataEntries()
>> could be reordered with the read of maxMsgNum. Which, afaict, would lead to
>> missing messages. That's not prevented by the pg_write_barrier() in
>> SIInsertDataEntries(). I think there may be other similar dangers.
>>
>> This could be solved by adding full memory barriers in a few places.
>
> It is quite interesting remark, because SpinLockAcquire may be implemented
> as `__sync_lock_test_and_set(lock, 1)` [1] on ARM/ARM64, and
> `__sync_lock_test_and_set` has only "acquire barrier" semantic as GCC's
> documentation says [2] . More over, `__sync_lock_release` used in this case
> also provides only "release barrier" semantic.
>
> Therefore, concerning these facts, current code state also doesn't give
> guarantee `stateP->hasMessage = false` will not be reordered with `max =
> segP->maxMsgNum`. Nor following read of messages are guaranteed to happen
> after `max = segP->maxMsgNum`.
>
> Given your expectations for SpinLockAcquire and SpinLockRelease to be full
> memory barriers, and probably numerous usages of them as such, their
> current implementation on ARM/ARM64 looks to be completely broken, imho.
I recognize my mistake and apologize for noise.
I'll try to describe how I understand it now.
There 4 operations done in two threads:
1m. write maxMsgNum
1h. set hasMessages
2h. unset hasMessages
2m. read maxMsgNum
Synchronization should forbid following orders of operations:
- 1h, 2h, 2m, 1m
- 1h, 2m, 1m, 2h
- 1h, 2m, 2h, 1m
- 2m, 1m, 1h, 2h
- 2m, 1h, 1m, 2h
- 2m, 1h, 2h, 1m
In other words: if read of maxMsgNum (2m) happened before write of
maxMsgNum (1m), then unset hasMessages (2h) should not happen after set of
hasMessages (1h).
Given 1m and 2m are wrapped in spin lock:
- while 1h could be reordered before 1m (since SpinLockRelease is just
release barrier), it could not be reordered before SpinLockAcquire (because
it is acquire barrier).
- same way while 2h could be reordered after 2m , it could not be reordered
with SpinLockRelease.
Introducing acquire and release operations, wrong orders above looks like:
- 1a( 1h, 2a{ 2h, 2m 2r}, 1m 1r)
- 1a( 1h, 2a{ 2m, 1m 1r), 2m 2r}
- 1a( 1h, 2a{ 2m, 2h 2r}, 1m 1r)
- 2a{ 2m, 1a( 1m, 1h 1r), 2h 2r}
- 2a{ 2m, 1a( 1h, 1m 1r), 2h 2r}
- 2a{ 2m, 1a( 1h, 2h 2r}, 1m 1r)
Therefore it is clear all wrong orders are forbidden by the spin lock
despite SpinLockAcquire and SpinLockRelease provides only acquire and
release semantics respectively.
Excuse me once again for misunderstanding this at first time.
But still problem of spin lock contention is here. And we found another one
place which have similar problem: numerous readers fights for single
slock_t slowing down them selves and writer. It because of
XLogRecoveryCtlData->info_lck. When there are a lot of logical/physical
walsenders, it could become noticeable bottleneck.
Part of this bottleneck were due to incorrect logical WAL senders target,
and were fixed in commit 5231ed82 [1]. But still there are measurable
speedup from mitigating spin lock contention at this place.
So I propose to introduce another spin lock type capable for Exclusive and
Shared lock modes (i.e. Write/Read modes) and use it in this two places.
I've tried to implement it with a bits of fairness: both readers and
writers attempters have "waiter" state, which blocks other side. Ie if
reader entered "waiter" state, it blocks concurrent writers (both fresh and
those in "waiter" state). But if writer entered "waiter" state, it blocks
fresh readers (but doesn't block readers, who entered "waiter" state).
Algorithm is self-made, I didn't found exactly same algorithm in the wild.
Patch is attached.
PS. Another possibility would be to use versionned optimistic lock:
- lock will be implemented as an pg_atomic_uint64 "version" storage,
- writer increments it both in LockWrite and in UnlockWrite. Therefore lock
is locked for write if version is odd.
- reader performs following loop:
-- wait untils version is even and remember it
-- perform read operations
-- if version didn't change, read is successful. Otherwise repeat loop.
Benefit of this approach: reader doesn't perform write to shared memory at all.
But it should use memory barriers before and after read loop to immitate
SpinLock semantic. And since we have no separate "acquire" and "release"
memory fences, it have to be pg_memory_barrier(). Or we should introduce
"acq/rel" fences
Other gotchas is only simple read operations allowed since data could be
changed while they are read (both found usecases satisfies this condition).
And read operations are repeated in a loop until version didn't change
around, so some kind of syntax sugar should be introduced.
This is well known approach, but I don't remember exact name and could not
google it fast.
--
regards
Yura Sokolov aka funny-falcon
Attachment | Content-Type | Size |
---|---|---|
v4-0001-Exclusive-Shared-Read-Write-spin-lock.patch | text/x-patch | 17.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2025-06-02 18:21:32 | Re: MergeAppend could consider sorting cheapest child path |
Previous Message | Tom Lane | 2025-06-02 18:13:43 | Re: tighten generic_option_name, or store more carefully in catalog? |