Re: BufferAlloc: don't take two simultaneous locks

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2021-10-01 22:46:26
Message-ID: CALNJ-vQF2tPW_rRkV-pBw0BNgBkKvUunt1vYj-5WHLkQV4aOSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
wrote:

> Good day.
>
> I found some opportunity in Buffer Manager code in BufferAlloc
> function:
> - When valid buffer is evicted, BufferAlloc acquires two partition
> lwlocks: for partition for evicted block is in and partition for new
> block placement.
>
> It doesn't matter if there is small number of concurrent replacements.
> But if there are a lot of concurrent backends replacing buffers,
> complex dependency net quickly arose.
>
> It could be easily seen with select-only pgbench with scale 100 and
> shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly
> doesn't fit shared buffers. This way performance starts to degrade at
> ~100 connections. Even with shared buffers 1GB it slowly degrades after
> 150 connections.
>
> But strictly speaking, there is no need to hold both lock
> simultaneously. Buffer is pinned so other processes could not select it
> for eviction. If tag is cleared and buffer removed from old partition
> then other processes will not find it. Therefore it is safe to release
> old partition lock before acquiring new partition lock.
>
> If other process concurrently inserts same new buffer, then old buffer
> is placed to bufmanager's freelist.
>
> Additional optimisation: in case of old buffer is reused, there is no
> need to put its BufferLookupEnt into dynahash's freelist. That reduces
> lock contention a bit more. To acomplish this FreeListData.nentries is
> changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement
> is used.
>
> Remark: there were bug in the `hash_update_hash_key`: nentries were not
> kept in sync if freelist partitions differ. This bug were never
> triggered because single use of `hash_update_hash_key` doesn't move
> entry between partitions.
>
> There is some tests results.
>
> - pgbench with scale 100 were tested with --select-only (since we want
> to test buffer manager alone). It produces 1.5GB table.
> - two shared_buffers values were tested: 128MB and 1GB.
> - second best result were taken among five runs
>
> Test were made in three system configurations:
> - notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
> - Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
> - same Xeon X5675 but restricted to single socket
> (with numactl -m 0 -N 0)
>
> Results for i7-1165G7:
>
> conns | master | patched | master 1G | patched 1G
> --------+------------+------------+------------+------------
> 1 | 29667 | 29079 | 29425 | 29411
> 2 | 55577 | 55553 | 57974 | 57223
> 3 | 87393 | 87924 | 87246 | 89210
> 5 | 136222 | 136879 | 133775 | 133949
> 7 | 179865 | 176734 | 178297 | 175559
> 17 | 215953 | 214708 | 222908 | 223651
> 27 | 211162 | 213014 | 220506 | 219752
> 53 | 211620 | 218702 | 220906 | 225218
> 83 | 213488 | 221799 | 219075 | 228096
> 107 | 212018 | 222110 | 222502 | 227825
> 139 | 207068 | 220812 | 218191 | 226712
> 163 | 203716 | 220793 | 213498 | 226493
> 191 | 199248 | 217486 | 210994 | 221026
> 211 | 195887 | 217356 | 209601 | 219397
> 239 | 193133 | 215695 | 209023 | 218773
> 271 | 190686 | 213668 | 207181 | 219137
> 307 | 188066 | 214120 | 205392 | 218782
> 353 | 185449 | 213570 | 202120 | 217786
> 397 | 182173 | 212168 | 201285 | 216489
>
> Results for 1 socket X5675
>
> conns | master | patched | master 1G | patched 1G
> --------+------------+------------+------------+------------
> 1 | 16864 | 16584 | 17419 | 17630
> 2 | 32764 | 32735 | 34593 | 34000
> 3 | 47258 | 46022 | 49570 | 47432
> 5 | 64487 | 64929 | 68369 | 68885
> 7 | 81932 | 82034 | 87543 | 87538
> 17 | 114502 | 114218 | 127347 | 127448
> 27 | 116030 | 115758 | 130003 | 128890
> 53 | 116814 | 117197 | 131142 | 131080
> 83 | 114438 | 116704 | 130198 | 130985
> 107 | 113255 | 116910 | 129932 | 131468
> 139 | 111577 | 116929 | 129012 | 131782
> 163 | 110477 | 116818 | 128628 | 131697
> 191 | 109237 | 116672 | 127833 | 131586
> 211 | 108248 | 116396 | 127474 | 131650
> 239 | 107443 | 116237 | 126731 | 131760
> 271 | 106434 | 115813 | 126009 | 131526
> 307 | 105077 | 115542 | 125279 | 131421
> 353 | 104516 | 115277 | 124491 | 131276
> 397 | 103016 | 114842 | 123624 | 131019
>
> Results for 2 socket x5675
>
> conns | master | patched | master 1G | patched 1G
> --------+------------+------------+------------+------------
> 1 | 16323 | 16280 | 16959 | 17598
> 2 | 30510 | 31431 | 33763 | 31690
> 3 | 45051 | 45834 | 48896 | 47991
> 5 | 71800 | 73208 | 78077 | 74714
> 7 | 89792 | 89980 | 95986 | 96662
> 17 | 178319 | 177979 | 195566 | 196143
> 27 | 210475 | 205209 | 226966 | 235249
> 53 | 222857 | 220256 | 252673 | 251041
> 83 | 219652 | 219938 | 250309 | 250464
> 107 | 218468 | 219849 | 251312 | 251425
> 139 | 210486 | 217003 | 250029 | 250695
> 163 | 204068 | 218424 | 248234 | 252940
> 191 | 200014 | 218224 | 246622 | 253331
> 211 | 197608 | 218033 | 245331 | 253055
> 239 | 195036 | 218398 | 243306 | 253394
> 271 | 192780 | 217747 | 241406 | 253148
> 307 | 189490 | 217607 | 239246 | 253373
> 353 | 186104 | 216697 | 236952 | 253034
> 397 | 183507 | 216324 | 234764 | 252872
>
> As can be seen, patched version degrades much slower than master.
> (Or even doesn't degrade with 1G shared buffer on older processor).
>
> PS.
>
> There is a room for further improvements:
> - buffer manager's freelist could be partitioned
> - dynahash's freelist could be sized/aligned to CPU cache line
> - in fact, there is no need in dynahash at all. It is better to make
> custom hash-table using BufferDesc as entries. BufferDesc has spare
> space for link to next and hashvalue.
>
> regards,
> Yura Sokolov
> y(dot)sokolov(at)postgrespro(dot)ru
> funny(dot)falcon(at)gmail(dot)com

Hi,
Improvement is impressive.

For BufTableFreeDeleted(), since it only has one call, maybe its caller can
invoke hash_return_to_freelist() directly.

For free_list_decrement_nentries():

+ Assert(hctl->freeList[freelist_idx].nentries.value < MAX_NENTRIES);

Is the assertion necessary ? There is similar assertion in
free_list_increment_nentries() which would
maintain hctl->freeList[freelist_idx].nentries.value <= MAX_NENTRIES.

Cheers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-10-01 23:02:27 Re: PATH manipulation in 001_libpq_pipeline.pl fails on windows
Previous Message Andres Freund 2021-10-01 22:27:52 Adding CI to our tree