Re: WAL Insertion Lock Improvements

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WAL Insertion Lock Improvements
Date: 2023-05-10 12:04:38
Message-ID: ZFuIViUI6xy6VegI@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 08, 2023 at 05:57:09PM +0530, Bharath Rupireddy wrote:
> Note that I've used pg_logical_emit_message() for ease of
> understanding about the txns generating various amounts of WAL, but
> the pattern is the same if txns are generating various amounts of WAL
> say with inserts.

Sounds good to me to just rely on that for some comparison numbers.

+ * NB: LWLockConflictsWithVar (which is called from
+ * LWLockWaitForVar) relies on the spinlock used above in this
+ * function and doesn't use a memory barrier.

This patch adds the following comment in WaitXLogInsertionsToFinish()
because lwlock.c on HEAD mentions that:
/*
* Test first to see if it the slot is free right now.
*
* XXX: the caller uses a spinlock before this, so we don't need a memory
* barrier here as far as the current usage is concerned. But that might
* not be safe in general.
*/

Should it be something where we'd better be noisy about at the top of
LWLockWaitForVar()? We don't want to add a memory barrier at the
beginning of LWLockConflictsWithVar(), still it strikes me that
somebody that aims at using LWLockWaitForVar() may miss this point
because LWLockWaitForVar() is the routine published in lwlock.h, not
LWLockConflictsWithVar(). This does not need to be really
complicated, say a note at the top of LWLockWaitForVar() among the
lines of (?):
"Be careful that LWLockConflictsWithVar() does not include a memory
barrier, hence the caller of this function may want to rely on an
explicit barrier or a spinlock to avoid memory ordering issues."

>> + * NB: Note the use of pg_atomic_exchange_u64 as opposed to just
>> + * pg_atomic_write_u64 to update the value. Since pg_atomic_exchange_u64 is
>> + * a full barrier, we're guaranteed that the subsequent shared memory
>> + * reads/writes, if any, happen after we reset the lock variable.
>>
>> This mentions that the subsequent read/write operations are safe, so
>> this refers to anything happening after the variable is reset. As
>> a full barrier, should be also mention that this is also ordered with
>> respect to anything that the caller did before clearing the variable?
>> From this perspective using pg_atomic_exchange_u64() makes sense to me
>> in LWLockReleaseClearVar().
>
> Wordsmithed that comment a bit.

- * Set the variable's value before releasing the lock, that prevents race
- * a race condition wherein a new locker acquires the lock, but hasn't yet
- * set the variables value.
[...]
+ * NB: pg_atomic_exchange_u64 is used here as opposed to just
+ * pg_atomic_write_u64 to update the variable. Since pg_atomic_exchange_u64
+ * is a full barrier, we're guaranteed that all loads and stores issued
+ * prior to setting the variable are completed before any loads or stores
+ * issued after setting the variable.

This is the same explanation as LWLockUpdateVar(), except that we
lose the details explaining why we are doing the update before
releasing the lock.

It took me some time, but I have been able to deploy a big box to see
the effect of this patch at a rather large scale (64 vCPU, 512G of
memory), with the following test characteristics for HEAD and v6:
- TPS comparison with pgbench and pg_logical_emit_message().
- Record sizes of 16, 64, 256, 1k, 4k and 16k.
- Clients and jobs equal at 4, 16, 64, 256, 512, 1024, 2048, 4096.
- Runs of 3 mins for each of the 48 combinations, meaning 96 runs in
total.

And here are the results I got:
message_size_b | 16 | 64 | 256 | 1024 | 4096 | 16k
------------------|--------|--------|--------|--------|-------|-------
head_4_clients | 3026 | 2965 | 2846 | 2880 | 2778 | 2412
head_16_clients | 12087 | 11287 | 11670 | 11100 | 9003 | 5608
head_64_clients | 42995 | 44005 | 43592 | 35437 | 21533 | 11273
head_256_clients | 106775 | 109233 | 104201 | 80759 | 42118 | 16254
head_512_clients | 153849 | 156950 | 142915 | 99288 | 57714 | 16198
head_1024_clients | 122102 | 123895 | 114248 | 117317 | 62270 | 16261
head_2048_clients | 126730 | 115594 | 109671 | 119564 | 62454 | 16298
head_4096_clients | 111564 | 111697 | 119164 | 123483 | 62430 | 16140
v6_4_clients | 2893 | 2917 | 3087 | 2904 | 2786 | 2262
v6_16_clients | 12097 | 11387 | 11579 | 11242 | 9228 | 5661
v6_64_clients | 45124 | 46533 | 42275 | 36124 | 21696 | 11386
v6_256_clients | 121500 | 125732 | 104328 | 78989 | 41949 | 16254
v6_512_clients | 164120 | 174743 | 146677 | 98110 | 60228 | 16171
v6_1024_clients | 168990 | 180710 | 149894 | 117431 | 62271 | 16259
v6_2048_clients | 165426 | 162893 | 146322 | 132476 | 62468 | 16274
v6_4096_clients | 161283 | 158732 | 162474 | 135636 | 62461 | 16030

These tests are not showing me any degradation, and a correlation
between the record size and the number of clients where the TPS begins
to show a difference between HEAD and v6 of the patch. In short the
shorter the record, the better performance gets at a lower client
number, still this required at least 256~512 clients with even
messages of 16bytes. At the end I'm cool with that.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-05-10 12:05:25 Re: Time delayed LR (WAS Re: logical replication restrictions)
Previous Message Melih Mutlu 2023-05-10 11:58:26 Re: Subscription statistics are not dropped at DROP SUBSCRIPTION in some cases