Quick Links

Re: Lockless queue of waiters in LWLock

From:	Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Lockless queue of waiters in LWLock
Date:	2022-11-11 11:39:05
Message-ID:	CALT9ZEHSX1Hpz5xjDA62yHAHtpinkA6hg8Zt-odyxqppmKbQFA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi, Andres, and Alexander!

I've done some more measurements to check the hypotheses regarding the
performance of a previous patch v2, and explain the results of tests
in [1].

The results below are the same (tps vs connections) plots as in [1],
and the test is identical to the insert test in this thread [2].
Additionally, in each case, there is a plot with results relative to
Andres Freund's patch [3]. Log plots are good for seeing details in
the range of 20-30 connections, but they somewhat hide the fact that
the effect in the range of 500+ connections is much more significant
overall, so I'd recommend looking at the linear plots as well.

The particular tests:

1. Whether CAS operation cost in comparison to the atomic-sub affects
performance.
The patch (see attached
atomic-sub-instead-of-CAS-for-shared-lockers-without.patch) uses
atomic-sub to change LWlock state variable for releasing a shared lock
that doesn't have a waiting queue.
The results are attached (see atomic-sub-inverted-queue-*.jpg)

2. Whether sending wake signals in inverted order affects performance.
In patch v2 the most recent shared lockers taken are woken up first
(but anyway they all get wake signals at the same wake pass).
The patch (see attached Invert-wakeup-queue-order.patch) inverts the
wake queue of lockers so the last lockers come first to wake.
The results are attached to the same plot as the previous test (see
atomic-sub-inverted-queue-*.jpg)

It seems that 1 and 2 don't have a visible effect on performance.

3. In the original master patch lock taking is tried twice separated
by the cheap spin delay before waiting on a more expensive semaphore.
This might explain why the proposed lockless queue patch [1] has a
little performance degradation around 20-30 connections. While we
don't need two-step lock-taking anymore I've tried to add it to see
whether it can improve performance. I've tried different initial
spinlock delays from 1 microsecond to 100 milliseconds and 1 or 2
retries (in case the lock is still busy in each case. (see the
attached: Extra-spin-waits-for-the-lock-to-be-freed-by-concurr.patch)

The results attached (see add-spin-delays-*.jpg) are interesting.
Indeed second attempt to take lock after the spin delay will increase
performance in any combinations of delays and retries. Also, the delay
and the number of retries act in opposite directions to the regions of
20-30 connections and 500+ connections. So we can choose to have a
more even performance gain at any number of connections (e.g. 2
retries x 500 microseconds) or better performance at 500+ connections
and the same performance as in Andres's patch around 20-30 connections
(e.g. 1 retry x 100 milliseconds).

4. I've also collected some statistics for the overall (sum for all
backends) number and duration of spin-delays in the same test in
Andres Freund's patch gathered using a slightly modified LWLOCK_STATS
mechanism.
conns / overall spin delays, s / overall number of spin delays /
average time of spin delay, ms
20 / 0 / 0 / 0
100 / 1.9 / 1177 / 1.6
200 / 21.9 / 14833 / 1.5
1000 / 12.9 / 11909 / 1.1
2000 / 3.2 / 2297 / 1.4

I've attached v3 of the lockless queue patch with added 2 retries x
500 microseconds spin delay which makes the test results superior to
the existing patch, leaves no performance degradation, and with steady
performance gain in the whole range of connections. But it's surely
worth discussing which parameters we want to have in the production
patch.

I'm also planning to do the same tests on an ARM server when the free
one comes available to me.
Thoughts?

Regards,
Pavel Borisov.

[1] https://www.postgresql.org/message-id/CALT9ZEF7q%2BSarz1MjrX-fM7OsoU7CK16%3DONoGCOkY3Efj%2BGrnw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CALT9ZEEz%2B%3DNepc5eti6x531q64Z6%2BDxtP3h-h_8O5HDdtkJcPw%40mail.gmail.com
[3] https://www.postgresql.org/message-id/20221031235114.ftjkife57zil7ryw%40awork3.anarazel.de

Attachment	Content-Type	Size
atomic-sub-inverted-queue-log.jpg	image/jpeg	273.3 KB
add-spin-delays-log.jpg	image/jpeg	318.7 KB
atomic-sub-inverted-queue-linear.jpg	image/jpeg	213.9 KB
add-spin-delays-linear.jpg	image/jpeg	343.1 KB
v3-0001-Lockless-queue-of-LWLock-waiters.patch	application/octet-stream	44.2 KB
Extra-spin-waits-for-the-lock-to-be-freed-by-concurr.patch	application/octet-stream	2.3 KB
Invert-wakeup-queue-order.patch	application/octet-stream	3.2 KB
atomic-sub-instead-of-CAS-for-shared-lockers-without.patch	application/octet-stream	1.7 KB

In response to

Re: Lockless queue of waiters in LWLock at 2022-11-05 19:32:40 from Andres Freund

Responses

Re: Lockless queue of waiters in LWLock at 2022-11-11 12:39:10 from Pavel Borisov
Re: Lockless queue of waiters in LWLock at 2022-11-11 14:41:59 from Alexander Korotkov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David Geier	2022-11-11 12:01:15	Optimize join selectivity estimation by not reading MCV stats for unique join attributes
Previous Message	Amit Kapila	2022-11-11 11:37:35	Re: Perform streaming logical transactions by background workers and parallel apply