Re: WAL Insertion Lock Improvements

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WAL Insertion Lock Improvements
Date: 2023-05-08 12:27:09
Message-ID: CALj2ACWGHGJ0pCsuVqJ4QkEfv8D_69rZL0FUK7TFfukaOcf7YQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 10, 2023 at 9:38 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> I have been trying to get some reproducible numbers, but I think that
> I am going to need a bigger maching than what I have been using for
> the last few days, up to 400 connections. It is worth noting that
> 00d1e02b may influence a bit the results, so we may want to have more
> numbers with that in place particularly with INSERTs, and one of the
> tests used upthread uses single row INSERTs.

I ran performance tests on the patch with different use-cases. Clearly
the patch reduces burden on LWLock's waitlist lock (evident from perf
reports [1]). However, to see visible impact in the output, the txns
must be generating small (between 16 bytes to 2 KB) amounts of WAL in
a highly concurrent manner, check the results below (FWIW, I've zipped
and attached perf images for better illustration along with test
setup).

When the txns are generating a small amount of WAL i.e. between 16
bytes to 2 KB in a highly concurrent manner, the benefit is clearly
visible in the TPS more than 2.3X improvement. When the txns are
generating more WAL i.e. more than 2 KB, the gain from reduced burden
on waitlist lock is offset by increase in the wait/release for WAL
insertion locks and no visible benefit is seen.

As the amount of WAL each txn generates increases, it looks like the
benefit gained from reduced burden on waitlist lock is offset by
increase in the wait for WAL insertion locks.

Note that I've used pg_logical_emit_message() for ease of
understanding about the txns generating various amounts of WAL, but
the pattern is the same if txns are generating various amounts of WAL
say with inserts.

test-case 1: -T5, WAL ~16 bytes
clients HEAD PATCHED
1 1437 1352
2 1376 1419
4 2919 2774
8 5875 6371
16 11148 12242
32 22108 23532
64 41414 46478
128 85304 85235
256 83771 152901
512 61970 141021
768 56514 118899
1024 51784 110960
2048 39141 84150
4096 16901 45759

test-case 1: -t1000, WAL ~16 bytes
clients HEAD PATCHED
1 1417 1333
2 1363 1791
4 2978 2970
8 5954 6198
16 11179 11164
32 23742 24043
64 45537 44103
128 84683 91762
256 80369 146293
512 61080 132079
768 57236 118046
1024 53497 114574
2048 46423 93588
4096 42067 85790

test-case 2: -T5, WAL ~256 bytes
clients HEAD PATCHED
1 1521 1386
2 1647 1637
4 3088 3270
8 6011 5631
16 12778 10317
32 24117 20006
64 43966 38199
128 72660 67936
256 93096 121261
512 57247 142418
768 53782 126218
1024 50279 109153
2048 35109 91602
4096 21184 39848

test-case 2: -t1000, WAL ~256 bytes
clients HEAD PATCHED
1 1265 1389
2 1522 1258
4 2802 2775
8 5875 5422
16 11664 10853
32 21961 22145
64 44304 40851
128 73278 80494
256 91172 122287
512 60966 136734
768 56590 125050
1024 52481 124341
2048 47878 104760
4096 42838 94121

test-case 3: -T5, WAL 512 bytes
clients HEAD PATCHED
1 1464 1284
2 1520 1381
4 2985 2877
8 6237 5261
16 11296 10621
32 22257 20789
64 40548 37243
128 66507 59891
256 92516 97506
512 56404 119716
768 51127 112482
1024 48463 103484
2048 38079 81424
4096 18977 40942

test-case 3: -t1000, WAL 512 bytes
clients HEAD PATCHED
1 1452 1434
2 1604 1649
4 3051 2971
8 5967 5650
16 10471 10702
32 20257 20899
64 39412 36750
128 62767 61110
256 81050 89768
512 56888 122786
768 51238 114444
1024 48972 106867
2048 43451 98847
4096 40018 111079

test-case 4: -T5, WAL 1024 bytes
clients HEAD PATCHED
1 1405 1395
2 1638 1607
4 3176 3207
8 6271 6024
16 11653 11103
32 20530 20260
64 34313 32367
128 55939 52079
256 74355 76420
512 56506 90983
768 50088 100410
1024 44589 99025
2048 39640 90931
4096 20942 36035

test-case 4: -t1000, WAL 1024 bytes
clients HEAD PATCHED
1 1330 1304
2 1615 1366
4 3117 2667
8 6179 5390
16 10524 10426
32 19819 18620
64 34844 29731
128 52180 48869
256 73284 71396
512 55714 96014
768 49336 108100
1024 46113 102789
2048 44627 104721
4096 44979 106189

test-case 5: -T5, WAL 2048 bytes
clients HEAD PATCHED
1 1407 1377
2 1518 1559
4 2589 2870
8 4883 5493
16 9075 9201
32 15957 16295
64 27471 25029
128 37493 38642
256 46369 45787
512 61755 62836
768 59144 68419
1024 52495 68933
2048 48608 72500
4096 26463 61252

test-case 5: -t1000, WAL 2048 bytes
clients HEAD PATCHED
1 1289 1366
2 1489 1628
4 2960 3036
8 5536 5965
16 9248 10399
32 15770 18140
64 27626 27800
128 36817 39483
256 48533 52105
512 64453 64007
768 59146 64160
1024 57637 61756
2048 59063 62109
4096 58268 61206

test-case 6: -T5, WAL 4096 bytes
clients HEAD PATCHED
1 1322 1325
2 1504 1551
4 2811 2880
8 5330 5159
16 8625 8315
32 12820 13534
64 19737 19965
128 26298 24633
256 34630 29939
512 34382 36669
768 33421 33316
1024 33525 32821
2048 37053 37752
4096 37334 39114

test-case 6: -t1000, WAL 4096 bytes
clients HEAD PATCHED
1 1212 1371
2 1383 1566
4 2858 2967
8 5092 5035
16 8233 8486
32 13353 13678
64 19052 20072
128 24803 24726
256 34065 33139
512 31590 32029
768 31432 31404
1024 31357 31366
2048 31465 31508
4096 32157 32180

test-case 7: -T5, WAL 8192 bytes
clients HEAD PATCHED
1 1287 1233
2 1552 1521
4 2658 2617
8 4680 4532
16 6732 7110
32 9649 9198
64 13276 12042
128 17100 17187
256 17408 17448
512 16595 16358
768 16599 16500
1024 16975 17300
2048 19073 19137
4096 21368 21735

test-case 7: -t1000, WAL 8192 bytes
clients HEAD PATCHED
1 1144 1190
2 1414 1395
4 2618 2438
8 4645 4485
16 6766 7001
32 9620 9804
64 12943 13023
128 15904 17148
256 16645 16035
512 15800 15796
768 15788 15810
1024 15814 15817
2048 17775 17771
4096 31715 31682

> Looking at the patch.. LWLockConflictsWithVar() and
> LWLockReleaseClearVar() are the trivial bits. These are OK.

Hm, the crux of the patch is avoiding LWLock's waitlist lock for
reading/writing the lock variable. Essentially, they are important
bits.

> + * NB: Note the use of pg_atomic_exchange_u64 as opposed to just
> + * pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 is
> + * a full barrier, we're guaranteed that the subsequent shared memory
> + * reads/writes, if any, happen after we reset the lock variable.
>
> This mentions that the subsequent read/write operations are safe, so
> this refers to anything happening after the variable is reset. As
> a full barrier, should be also mention that this is also ordered with
> respect to anything that the caller did before clearing the variable?
> From this perspective using pg_atomic_exchange_u64() makes sense to me
> in LWLockReleaseClearVar().

Wordsmithed that comment a bit.

> + * XXX: Use of a spinlock at the beginning of this function to read
> + * current insert position implies memory ordering. That means that
> + * the immediate loads and stores to shared memory (for instance,
> + * in LWLockUpdateVar called via LWLockWaitForVar) don't need an
> + * explicit memory barrier as far as the current usage is
> + * concerned. But that might not be safe in general.
> */
> What's the part where this is not safe? Based on what I see, this
> code path is safe because of the previous spinlock. This is the same
> comment as at the beginning of LWLockConflictsWithVar(). Is that
> something that we ought to document at the top of LWLockWaitForVar()
> as well? We have one caller of this function currently, but there may
> be more in the future.

'But that might not be safe in general' applies only for
LWLockWaitForVar not for WaitXLogInsertionsToFinish for sure. My bad.

If there's another caller for LWLockWaitForVar without any spinlock,
that's when the LWLockWaitForVar needs to have an explicit memory
barrier.

Per a comment upthread
https://www.postgresql.org/message-id/20221205183007.s72oygp63s43dqyz%40awork3.anarazel.de,
I had a note in WaitXLogInsertionsToFinish before LWLockWaitForVar. I
now have modified that comment.

> - * you're about to write out.
> + * you're about to write out. Using an atomic variable for insertingAt avoids
> + * taking any explicit lock for reads and writes.
>
> Hmm. Not sure that we need to comment at all.

Removed. I was being verbose. One who understands pg_atomic_uint64 can
get to that point easily.

> -LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
> +LWLockUpdateVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val)
> [...]
> Assert(pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE);
>
> - /* Update the lock's value */
> - *valptr = val;
>
> The sensitive change is in LWLockUpdateVar(). I am not completely
> sure to understand this removal, though. Does that influence the case
> where there are waiters?

I'll send about this in a follow-up email to not overload this
response with too much data.

> Another thing I was wondering about: how much does the fast-path used
> in LWLockUpdateVar() influence the performance numbers? Am I right to
> guess that it counts for most of the gain seen?

I'll send about this in a follow-up email to not overload this
response with too much data.

> Or could it be that
> the removal of the spin lock in
> LWLockConflictsWithVar()/LWLockWaitForVar() the point that has the
> highest effect?

I'll send about this in a follow-up email to not overload this
response with too much data.

I've addressed the above review comments and attached the v6 patch.

[1]
test-case 1: -T5, WAL ~16 bytes HEAD:
+ 81.52% 0.03% postgres [.] __vstrfmon_l_internal
+ 81.52% 0.00% postgres [.] startup_hacks
+ 81.52% 0.00% postgres [.] PostmasterMain
+ 63.95% 1.01% postgres [.] LWLockWaitListLock
+ 61.93% 0.02% postgres [.] WaitXLogInsertionsToFinish
+ 61.89% 0.05% postgres [.] LWLockWaitForVar
+ 48.83% 48.33% postgres [.] pg_atomic_read_u32_impl
+ 48.78% 0.40% postgres [.] pg_atomic_read_u32
+ 43.19% 0.12% postgres [.] LWLockConflictsWithVar
+ 19.81% 0.01% postgres [.] LWLockQueueSelf
+ 7.86% 2.46% postgres [.] perform_spin_delay
+ 6.14% 6.06% postgres [.] spin_delay
+ 5.82% 0.01% postgres [.] pg_atomic_fetch_or_u32
+ 5.81% 5.76% postgres [.] pg_atomic_fetch_or_u32_impl
+ 4.00% 0.01% postgres [.] XLogInsert
+ 3.93% 0.03% postgres [.] XLogInsertRecord
+ 2.13% 0.02% postgres [.] LWLockRelease
+ 2.10% 0.03% postgres [.] LWLockAcquire
+ 1.92% 0.00% postgres [.] LWLockDequeueSelf
+ 1.87% 0.01% postgres [.] WALInsertLockAcquire
+ 1.68% 0.04% postgres [.] LWLockAcquireOrWait
+ 1.64% 0.01% postgres [.] pg_analyze_and_rewrite_fixedparams
+ 1.62% 0.00% postgres [.] WALInsertLockRelease
+ 1.62% 0.00% postgres [.] LWLockReleaseClearVar
+ 1.55% 0.01% postgres [.] parse_analyze_fixedparams
+ 1.51% 0.00% postgres [.] transformTopLevelStmt
+ 1.50% 0.00% postgres [.] transformOptionalSelectInto
+ 1.50% 0.01% postgres [.] transformStmt
+ 1.47% 0.02% postgres [.] transformSelectStmt
+ 1.29% 0.01% postgres [.] XactLogCommitRecord

test-case 1: -T5, WAL ~16 bytes PATCHED:
+ 74.49% 0.04% postgres [.] __vstrfmon_l_internal
+ 74.49% 0.00% postgres [.] startup_hacks
+ 74.49% 0.00% postgres [.] PostmasterMain
+ 51.60% 0.01% postgres [.] finish_xact_command
+ 51.60% 0.02% postgres [.] CommitTransactionCommand
+ 51.37% 0.03% postgres [.] CommitTransaction
+ 49.43% 0.05% postgres [.] RecordTransactionCommit
+ 46.55% 0.05% postgres [.] XLogFlush
+ 46.37% 0.85% postgres [.] LWLockWaitListLock
+ 43.79% 0.02% postgres [.] LWLockQueueSelf
+ 38.87% 0.03% postgres [.] WaitXLogInsertionsToFinish
+ 38.79% 0.11% postgres [.] LWLockWaitForVar
+ 34.99% 34.49% postgres [.] pg_atomic_read_u32_impl
+ 34.93% 0.35% postgres [.] pg_atomic_read_u32
+ 6.99% 2.12% postgres [.] perform_spin_delay
+ 6.64% 0.01% postgres [.] XLogInsert
+ 6.54% 0.06% postgres [.] XLogInsertRecord
+ 6.26% 0.08% postgres [.] LWLockAcquireOrWait
+ 5.31% 5.22% postgres [.] spin_delay
+ 4.23% 0.04% postgres [.] LWLockRelease
+ 3.74% 0.01% postgres [.] pg_atomic_fetch_or_u32
+ 3.73% 3.68% postgres [.] pg_atomic_fetch_or_u32_impl
+ 3.33% 0.06% postgres [.] LWLockAcquire
+ 2.97% 0.01% postgres [.] pg_plan_queries
+ 2.95% 0.01% postgres [.] WALInsertLockAcquire
+ 2.94% 0.02% postgres [.] planner
+ 2.94% 0.01% postgres [.] pg_plan_query
+ 2.92% 0.01% postgres [.] LWLockDequeueSelf
+ 2.89% 0.04% postgres [.] standard_planner
+ 2.81% 0.00% postgres [.] WALInsertLockRelease
+ 2.80% 0.00% postgres [.] LWLockReleaseClearVar
+ 2.38% 0.07% postgres [.] subquery_planner
+ 2.35% 0.01% postgres [.] XactLogCommitRecord

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
TestArtifacts.zip application/zip 234.7 KB
v6-0001-Optimize-WAL-insertion-lock-acquisition-and-relea.patch application/x-patch 8.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2023-05-08 12:46:48 Re: [PATCH] Allow Postgres to pick an unused port to listen
Previous Message Jelte Fennema 2023-05-08 12:21:50 Re: proposal: psql: show current user in prompt