BufferAlloc: don't take two simultaneous locks

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: BufferAlloc: don't take two simultaneous locks
Date: 2021-10-01 22:25:57
Message-ID: 1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Good day.

I found some opportunity in Buffer Manager code in BufferAlloc
function:
- When valid buffer is evicted, BufferAlloc acquires two partition
lwlocks: for partition for evicted block is in and partition for new
block placement.

It doesn't matter if there is small number of concurrent replacements.
But if there are a lot of concurrent backends replacing buffers,
complex dependency net quickly arose.

It could be easily seen with select-only pgbench with scale 100 and
shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly
doesn't fit shared buffers. This way performance starts to degrade at
~100 connections. Even with shared buffers 1GB it slowly degrades after
150 connections.

But strictly speaking, there is no need to hold both lock
simultaneously. Buffer is pinned so other processes could not select it
for eviction. If tag is cleared and buffer removed from old partition
then other processes will not find it. Therefore it is safe to release
old partition lock before acquiring new partition lock.

If other process concurrently inserts same new buffer, then old buffer
is placed to bufmanager's freelist.

Additional optimisation: in case of old buffer is reused, there is no
need to put its BufferLookupEnt into dynahash's freelist. That reduces
lock contention a bit more. To acomplish this FreeListData.nentries is
changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement
is used.

Remark: there were bug in the `hash_update_hash_key`: nentries were not
kept in sync if freelist partitions differ. This bug were never
triggered because single use of `hash_update_hash_key` doesn't move
entry between partitions.

There is some tests results.

- pgbench with scale 100 were tested with --select-only (since we want
to test buffer manager alone). It produces 1.5GB table.
- two shared_buffers values were tested: 128MB and 1GB.
- second best result were taken among five runs

Test were made in three system configurations:
- notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
- Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
- same Xeon X5675 but restricted to single socket
(with numactl -m 0 -N 0)

Results for i7-1165G7:

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 29667 | 29079 | 29425 | 29411
2 | 55577 | 55553 | 57974 | 57223
3 | 87393 | 87924 | 87246 | 89210
5 | 136222 | 136879 | 133775 | 133949
7 | 179865 | 176734 | 178297 | 175559
17 | 215953 | 214708 | 222908 | 223651
27 | 211162 | 213014 | 220506 | 219752
53 | 211620 | 218702 | 220906 | 225218
83 | 213488 | 221799 | 219075 | 228096
107 | 212018 | 222110 | 222502 | 227825
139 | 207068 | 220812 | 218191 | 226712
163 | 203716 | 220793 | 213498 | 226493
191 | 199248 | 217486 | 210994 | 221026
211 | 195887 | 217356 | 209601 | 219397
239 | 193133 | 215695 | 209023 | 218773
271 | 190686 | 213668 | 207181 | 219137
307 | 188066 | 214120 | 205392 | 218782
353 | 185449 | 213570 | 202120 | 217786
397 | 182173 | 212168 | 201285 | 216489

Results for 1 socket X5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16864 | 16584 | 17419 | 17630
2 | 32764 | 32735 | 34593 | 34000
3 | 47258 | 46022 | 49570 | 47432
5 | 64487 | 64929 | 68369 | 68885
7 | 81932 | 82034 | 87543 | 87538
17 | 114502 | 114218 | 127347 | 127448
27 | 116030 | 115758 | 130003 | 128890
53 | 116814 | 117197 | 131142 | 131080
83 | 114438 | 116704 | 130198 | 130985
107 | 113255 | 116910 | 129932 | 131468
139 | 111577 | 116929 | 129012 | 131782
163 | 110477 | 116818 | 128628 | 131697
191 | 109237 | 116672 | 127833 | 131586
211 | 108248 | 116396 | 127474 | 131650
239 | 107443 | 116237 | 126731 | 131760
271 | 106434 | 115813 | 126009 | 131526
307 | 105077 | 115542 | 125279 | 131421
353 | 104516 | 115277 | 124491 | 131276
397 | 103016 | 114842 | 123624 | 131019

Results for 2 socket x5675

conns | master | patched | master 1G | patched 1G
--------+------------+------------+------------+------------
1 | 16323 | 16280 | 16959 | 17598
2 | 30510 | 31431 | 33763 | 31690
3 | 45051 | 45834 | 48896 | 47991
5 | 71800 | 73208 | 78077 | 74714
7 | 89792 | 89980 | 95986 | 96662
17 | 178319 | 177979 | 195566 | 196143
27 | 210475 | 205209 | 226966 | 235249
53 | 222857 | 220256 | 252673 | 251041
83 | 219652 | 219938 | 250309 | 250464
107 | 218468 | 219849 | 251312 | 251425
139 | 210486 | 217003 | 250029 | 250695
163 | 204068 | 218424 | 248234 | 252940
191 | 200014 | 218224 | 246622 | 253331
211 | 197608 | 218033 | 245331 | 253055
239 | 195036 | 218398 | 243306 | 253394
271 | 192780 | 217747 | 241406 | 253148
307 | 189490 | 217607 | 239246 | 253373
353 | 186104 | 216697 | 236952 | 253034
397 | 183507 | 216324 | 234764 | 252872

As can be seen, patched version degrades much slower than master.
(Or even doesn't degrade with 1G shared buffer on older processor).

PS.

There is a room for further improvements:
- buffer manager's freelist could be partitioned
- dynahash's freelist could be sized/aligned to CPU cache line
- in fact, there is no need in dynahash at all. It is better to make
custom hash-table using BufferDesc as entries. BufferDesc has spare
space for link to next and hashvalue.

regards,
Yura Sokolov
y(dot)sokolov(at)postgrespro(dot)ru
funny(dot)falcon(at)gmail(dot)com

Attachment Content-Type Size
v0-0001-bufmgr-do-not-acquire-two-partition-lo.patch text/x-patch 24.7 KB
image/gif 13.4 KB
image/gif 13.3 KB
image/gif 12.6 KB
res.zip application/zip 16.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-10-01 22:27:52 Adding CI to our tree
Previous Message Bossart, Nathan 2021-10-01 21:35:52 Re: Enabling deduplication with system catalog indexes