Re: BufferAlloc: don't take two simultaneous locks

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Zhihong Yu <zyu(at)yugabyte(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2021-10-04 04:18:56
Message-ID: ae8b7f99f8ee9583bd96e20e271b5969056ae44a.camel@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

В Пт, 01/10/2021 в 15:46 -0700, Zhihong Yu wrote:
>
>
> On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
> wrote:
> > Good day.
> >
> > I found some opportunity in Buffer Manager code in BufferAlloc
> > function:
> > - When valid buffer is evicted, BufferAlloc acquires two partition
> > lwlocks: for partition for evicted block is in and partition for new
> > block placement.
> >
> > It doesn't matter if there is small number of concurrent
> > replacements.
> > But if there are a lot of concurrent backends replacing buffers,
> > complex dependency net quickly arose.
> >
> > It could be easily seen with select-only pgbench with scale 100 and
> > shared buffers 128MB: scale 100 produces 1.5GB tables, and it
> > certainly
> > doesn't fit shared buffers. This way performance starts to degrade
> > at
> > ~100 connections. Even with shared buffers 1GB it slowly degrades
> > after
> > 150 connections.
> >
> > But strictly speaking, there is no need to hold both lock
> > simultaneously. Buffer is pinned so other processes could not select
> > it
> > for eviction. If tag is cleared and buffer removed from old
> > partition
> > then other processes will not find it. Therefore it is safe to
> > release
> > old partition lock before acquiring new partition lock.
> >
> > If other process concurrently inserts same new buffer, then old
> > buffer
> > is placed to bufmanager's freelist.
> >
> > Additional optimisation: in case of old buffer is reused, there is
> > no
> > need to put its BufferLookupEnt into dynahash's freelist. That
> > reduces
> > lock contention a bit more. To acomplish this FreeListData.nentries
> > is
> > changed to pg_atomic_u32/pg_atomic_u64 and atomic
> > increment/decrement
> > is used.
> >
> > Remark: there were bug in the `hash_update_hash_key`: nentries were
> > not
> > kept in sync if freelist partitions differ. This bug were never
> > triggered because single use of `hash_update_hash_key` doesn't move
> > entry between partitions.
> >
> > There is some tests results.
> >
> > - pgbench with scale 100 were tested with --select-only (since we
> > want
> > to test buffer manager alone). It produces 1.5GB table.
> > - two shared_buffers values were tested: 128MB and 1GB.
> > - second best result were taken among five runs
> >
> > Test were made in three system configurations:
> > - notebook with i7-1165G7 (limited to 2.8GHz to not overheat)
> > - Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads).
> > - same Xeon X5675 but restricted to single socket
> > (with numactl -m 0 -N 0)
> >
> > Results for i7-1165G7:
> >
> > conns | master | patched | master 1G | patched 1G
> > --------+------------+------------+------------+------------
> > 1 | 29667 | 29079 | 29425 | 29411
> > 2 | 55577 | 55553 | 57974 | 57223
> > 3 | 87393 | 87924 | 87246 | 89210
> > 5 | 136222 | 136879 | 133775 | 133949
> > 7 | 179865 | 176734 | 178297 | 175559
> > 17 | 215953 | 214708 | 222908 | 223651
> > 27 | 211162 | 213014 | 220506 | 219752
> > 53 | 211620 | 218702 | 220906 | 225218
> > 83 | 213488 | 221799 | 219075 | 228096
> > 107 | 212018 | 222110 | 222502 | 227825
> > 139 | 207068 | 220812 | 218191 | 226712
> > 163 | 203716 | 220793 | 213498 | 226493
> > 191 | 199248 | 217486 | 210994 | 221026
> > 211 | 195887 | 217356 | 209601 | 219397
> > 239 | 193133 | 215695 | 209023 | 218773
> > 271 | 190686 | 213668 | 207181 | 219137
> > 307 | 188066 | 214120 | 205392 | 218782
> > 353 | 185449 | 213570 | 202120 | 217786
> > 397 | 182173 | 212168 | 201285 | 216489
> >
> > Results for 1 socket X5675
> >
> > conns | master | patched | master 1G | patched 1G
> > --------+------------+------------+------------+------------
> > 1 | 16864 | 16584 | 17419 | 17630
> > 2 | 32764 | 32735 | 34593 | 34000
> > 3 | 47258 | 46022 | 49570 | 47432
> > 5 | 64487 | 64929 | 68369 | 68885
> > 7 | 81932 | 82034 | 87543 | 87538
> > 17 | 114502 | 114218 | 127347 | 127448
> > 27 | 116030 | 115758 | 130003 | 128890
> > 53 | 116814 | 117197 | 131142 | 131080
> > 83 | 114438 | 116704 | 130198 | 130985
> > 107 | 113255 | 116910 | 129932 | 131468
> > 139 | 111577 | 116929 | 129012 | 131782
> > 163 | 110477 | 116818 | 128628 | 131697
> > 191 | 109237 | 116672 | 127833 | 131586
> > 211 | 108248 | 116396 | 127474 | 131650
> > 239 | 107443 | 116237 | 126731 | 131760
> > 271 | 106434 | 115813 | 126009 | 131526
> > 307 | 105077 | 115542 | 125279 | 131421
> > 353 | 104516 | 115277 | 124491 | 131276
> > 397 | 103016 | 114842 | 123624 | 131019
> >
> > Results for 2 socket x5675
> >
> > conns | master | patched | master 1G | patched 1G
> > --------+------------+------------+------------+------------
> > 1 | 16323 | 16280 | 16959 | 17598
> > 2 | 30510 | 31431 | 33763 | 31690
> > 3 | 45051 | 45834 | 48896 | 47991
> > 5 | 71800 | 73208 | 78077 | 74714
> > 7 | 89792 | 89980 | 95986 | 96662
> > 17 | 178319 | 177979 | 195566 | 196143
> > 27 | 210475 | 205209 | 226966 | 235249
> > 53 | 222857 | 220256 | 252673 | 251041
> > 83 | 219652 | 219938 | 250309 | 250464
> > 107 | 218468 | 219849 | 251312 | 251425
> > 139 | 210486 | 217003 | 250029 | 250695
> > 163 | 204068 | 218424 | 248234 | 252940
> > 191 | 200014 | 218224 | 246622 | 253331
> > 211 | 197608 | 218033 | 245331 | 253055
> > 239 | 195036 | 218398 | 243306 | 253394
> > 271 | 192780 | 217747 | 241406 | 253148
> > 307 | 189490 | 217607 | 239246 | 253373
> > 353 | 186104 | 216697 | 236952 | 253034
> > 397 | 183507 | 216324 | 234764 | 252872
> >
> > As can be seen, patched version degrades much slower than master.
> > (Or even doesn't degrade with 1G shared buffer on older processor).
> >
> > PS.
> >
> > There is a room for further improvements:
> > - buffer manager's freelist could be partitioned
> > - dynahash's freelist could be sized/aligned to CPU cache line
> > - in fact, there is no need in dynahash at all. It is better to make
> > custom hash-table using BufferDesc as entries. BufferDesc has
> > spare
> > space for link to next and hashvalue.
> >
> > regards,
> > Yura Sokolov
> > y(dot)sokolov(at)postgrespro(dot)ru
> > funny(dot)falcon(at)gmail(dot)com
>
> Hi,
> Improvement is impressive.

Thank you!

> For BufTableFreeDeleted(), since it only has one call, maybe its
> caller can invoke hash_return_to_freelist() directly.

It will be a dirty break of abstraction. Everywhere we talk with
BufTable, and here will be hash ... eugh

> For free_list_decrement_nentries():
>
> + Assert(hctl->freeList[freelist_idx].nentries.value <
> MAX_NENTRIES);
>
> Is the assertion necessary ? There is similar assertion in
> free_list_increment_nentries() which would maintain hctl-
> >freeList[freelist_idx].nentries.value <= MAX_NENTRIES.

Assertion in free_list_decrement_nentries is absolutely necessary:
it is direct translation of Assert(nentries>=0) from signed types
to unsigned. (Since there is no signed atomics in pg, I had to convert
signed `long nentries` to unsigned `pg_atomic_uXX nentries`).

Assertion in free_list_increment_nentries is not necessary. But it
doesn't hurt either - it is just Assert that doesn't compile into
production code.

regards

Yura Sokolov
y(dot)sokolov(at)postgrespro(dot)ru
funny(dot)falcon(at)gmail(dot)com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-10-04 04:26:29 Re: pgsql: Document XLOG_INCLUDE_XID a little better
Previous Message Pavel Stehule 2021-10-04 04:06:28 Re: Triage on old commitfest entries - JSON_PATH