heavily contended lwlocks with long wait queues scale badly

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: heavily contended lwlocks with long wait queues scale badly
Date: 2022-10-27 16:59:14
Message-ID: 20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I am working on posting a patch series making relation extension more
scalable. As part of that I was running some benchmarks for workloads that I
thought should not or just positively impacted - but I was wrong, there was
some very significant degradation at very high client counts. After pulling my
hair out for quite a while to try to understand that behaviour, I figured out
that it's just a side-effect of *removing* some other contention. This
morning, turns out sleeping helps, I managed to reproduce it in an unmodified
postgres.

$ cat ~/tmp/txid.sql
SELECT txid_current();
$ for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";pgbench -n -M prepared -f ~/tmp/txid.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
1 60174
2 116169
4 208119
8 373685
16 515247
32 554726
64 497508
128 415097
256 334923
512 243679
768 192959
1024 157734
2048 82904
4096 32007

(I didn't properly round TPS, but that doesn't matter here)

Performance completely falls off a cliff starting at ~256 clients. There's
actually plenty CPU available here, so this isn't a case of running out of
CPU time.

Rather, the problem is very bad contention on the "spinlock" for the lwlock
wait list. I realized that something in that direction was off when trying to
investigate why I was seeing spin delays of substantial duration (>100ms).

The problem isn't a fundamental issue with lwlocks, it's that
LWLockDequeueSelf() does this:

LWLockWaitListLock(lock);

/*
* Can't just remove ourselves from the list, but we need to iterate over
* all entries as somebody else could have dequeued us.
*/
proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
{
if (iter.cur == MyProc->pgprocno)
{
found = true;
proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
break;
}
}

I.e. it iterates over the whole waitlist to "find itself". The longer the
waitlist gets, the longer this takes. And the longer it takes for
LWLockWakeup() to actually wake up all waiters, the more likely it becomes
that LWLockDequeueSelf() needs to be called.

We can't make the trivial optimization and use proclist_contains(), because
PGPROC->lwWaitLink is also used for the list of processes to wake up in
LWLockWakeup().

But I think we can solve that fairly reasonably nonetheless. We can change
PGPROC->lwWaiting to not just be a boolean, but have three states:
0: not waiting
1: waiting in waitlist
2: waiting to be woken up

which we then can use in LWLockDequeueSelf() to only remove ourselves from the
list if we're on it. As removal from that list is protected by the wait list
lock, there's no race to worry about.

client patched HEAD
1 60109 60174
2 112694 116169
4 214287 208119
8 377459 373685
16 524132 515247
32 565772 554726
64 587716 497508
128 581297 415097
256 550296 334923
512 486207 243679
768 449673 192959
1024 410836 157734
2048 326224 82904
4096 250252 32007

Not perfect with the patch, but not awful either.

I suspect this issue might actually explain quite a few odd performance
behaviours we've seen at the larger end in the past. I think it has gotten a
bit worse with the conversion of lwlock.c to proclists (I see lots of
expensive multiplications to deal with sizeof(PGPROC)), but otherwise likely
exists at least as far back as ab5194e6f61, in 9.5.

I guess there's an argument for considering this a bug that we should
backpatch a fix for? But given the vintage, probably not? The only thing that
gives me pause is that this is quite hard to pinpoint as happening.

I've attached my quick-and-dirty patch. Obviously it'd need a few defines etc,
but I wanted to get this out to discuss before spending further time.

Greetings,

Andres Freund

Attachment Content-Type Size
fix-lwlock-queue.diff text/x-diff 2.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-10-27 17:13:46 Re: Reducing planning time on tables with many indexes
Previous Message Simon Riggs 2022-10-27 16:35:38 Re: Code checks for App Devs, using new options for transaction behavior