Lockless queue of waiters in LWLock

From: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
To: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Lockless queue of waiters in LWLock
Date: 2022-10-31 10:38:23
Message-ID: CALT9ZEEz+=Nepc5eti6x531q64Z6+DxtP3h-h_8O5HDdtkJcPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, hackers!

When we take LWlock, we already use atomic CAS operation to atomically
modify the lock state even in the presence of concurrent lock-takers. But
if we can not take the lock immediately, we need to put the waiters on a
waiting list, and currently, this operation is done not atomically and in a
complicated way:
- Parts are done under LWLockWaitListLock, which includes a spin delay to
take.
- Also, we need a two-step procedure to immediately dequeue if, after
adding a current process into a wait queue, it appears that we don’t need
to (and can not!) sleep as the lock is already free.

If the lock state contains references to the queue head and tail, we can
implement a lockless queue of waiters for the LWLock. Adding new items to
the queue head or tail can be done with a single CAS operation (adding to
the tail will also require further setting the reference from the previous
tail). Given that there could be only one lock releaser, which wakes up
waiters in the queue, we can handle all the concurrency issues with
reasonable complexity.

Removing the queue spinlock bit and the corresponding contention should
give the performance gain on high concurrent LWLock usage scenarios.

Currently, the maximum size of procarray is 2^18 (as stated in
buf_internals.h), so with the use of the 64-bit LWLock state variable, we
can store the procarray indexes for both head and tail of the queue, all
the existing machinery, and even have some spare bits for future flags.

Thus we almost entirely avoid spinlocks and replace them with repeated try
to change the lock state if the CAS operation is unsuccessful due to
concurrent state modification.

The attached patch implements described approach. The patch is based on
the developments of Alexander Korotkov in the OrioleDB engine. I made
further adoption of those ideas with guidance from him.

We did some preliminary performance checks and saw that the concurrent
insert-only and txid_current tests with ~1000 connections pgbench
significantly increase performance. The scripts for testing and results are
attached. I’ve done the tests on 32-vcore x86-64 AWS machine using tmpfs as
storage for the database to avoid random delays related to disk IO.

Though recently, Andres proposed a different patch, which just evades O(N)
removal from the queue of waiters but improves performance even more [1].
The results of the comparison of the master branch, lockless queue
(current) patch, and Andres’ patch are below. Please, take into account
that the horizontal axis uses a log scale.

---------------------------------------
cat insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid,
:delta);

pgbench -d postgres -i -s 1 --unlogged-tables
echo -e
"max_connections=2500\nmax_wal_senders=0\nwal_level=minimal\nmax_wal_size =
10GB\nshared_buffers = 20000MB\nautovacuum = off\nfsync =
off\nfull_page_writes = off\nmax_worker_processes =
1024\nmax_parallel_workers = 1024\n" > ./pgdata$v/postgresql.auto.conf
psql -dpostgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;

for conns in 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 18 20 22 24 27 30 33 36 39
43 47 51 56 62 68 75 82 91 100 110 120 130 150 160 180 200 220 240 270 300
330 360 390 430 470 510 560 620 680 750 820 910 1000 1100 1200 1300 1500
1600 1800 2000

do
psql -dpostgres -c "truncate pgbench_accounts;"
psql -dpostgres -c "vacuum full pgbench_accounts;"
pgbench postgres -f insert.sql -s 1 -P20 -M prepared -T 10 -j 5 -c $conns
--no-vacuum | grep tps
done

--------------------------------------------------

cat txid.sql
select txid_current();

for conns in 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 18 20 22 24 27 30 33 36 39
43 47 51 56 62 68 75 82 91 100 110 120 130 150 160 180 200 220 240 270 300
330 360 390 430 470 510 560 620 680 750 820 910 1000 1100 1200 1300 1500
1600 1800 2000

do
pgbench postgres -f txid.sql -s 1 -P20 -M prepared -T 10 -j 5 -c $conns
--no-vacuum | grep tps
done

-----------------------------------------------------
I can not understand why the performance of a lockless queue patch has a
minor regression in the region of around 20 connections, even when compared
to the current master branch.

Are there some scenarios where the lockless queue approach is superior? I
expected they should be, at least in theory. Probably, there is a way to
improve the attached patch further to achieve that superiority.

Best regards,
Pavel Borisov,
Supabase.

[1]
https://www.postgresql.org/message-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de

Attachment Content-Type Size
v1-0001-Lockless-queue-of-LWLock-waiters.patch application/octet-stream 42.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-10-31 10:44:06 Re: Perform streaming logical transactions by background workers and parallel apply
Previous Message Frédéric Yhuel 2022-10-31 10:30:33 Re: [PATCH] minor optimization for ineq_histogram_selectivity()