Re: BufferAlloc: don't take two simultaneous locks

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, michail(dot)nikolaev(at)gmail(dot)com, x4mmm(at)yandex-team(dot)ru, andres(at)anarazel(dot)de, simon(dot)riggs(at)enterprisedb(dot)com
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2022-04-07 11:14:59
Message-ID: 045bf7d086e725ba30d7e41ed09651b3f836b0df.camel@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

В Чт, 07/04/2022 в 16:55 +0900, Kyotaro Horiguchi пишет:
> Hi, Yura.
>
> At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrot
> e in
> > Ok, I got access to stronger server, did the benchmark, found weird
> > things, and so here is new version :-)
>
> Thanks for the new version and benchmarking.
>
> > First I found if table size is strictly limited to NBuffers and FIXED,
> > then under high concurrency get_hash_entry may not find free entry
> > despite it must be there. It seems while process scans free lists, other
> > concurrent processes "moves etry around", ie one concurrent process
> > fetched it from one free list, other process put new entry in other
> > freelist, and unfortunate process missed it since it tests freelists
> > only once.
>
> StrategyGetBuffer believes that entries don't move across freelists
> and it was true before this patch.

StrategyGetBuffer knows nothing about dynahash's freelist.
It knows about buffer manager's freelist, which is not partitioned.

>
> > Second, I confirm there is problem with freelist spreading.
> > If I keep entry's freelist_idx, then one freelist is crowded.
> > If I use new entry's freelist_idx, then one freelist is emptified
> > constantly.
>
> Perhaps it is what I saw before. I'm not sure about the details of
> how that happens, though.
>
> > Third, I found increased concurrency could harm. When popular block
> > is evicted for some reason, then thundering herd effect occures:
> > many backends wants to read same block, they evict many other
> > buffers, but only one is inserted. Other goes to freelist. Evicted
> > buffers by itself reduce cache hit ratio and provocates more
> > work. Old version resists this effect by not removing old buffer
> > before new entry is successfully inserted.
>
> Nice finding.
>
> > To fix this issues I made following:
> >
> > # Concurrency
> >
> > First, I limit concurrency by introducing other lwlocks tranche -
> > BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs
> > 128).
> > If backend doesn't find buffer in buffer table and wants to introduce
> > it, it first calls
> > LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE)
> > If lock were acquired, then it goes to eviction and replace process.
> > Otherwise, it waits lock to be released and repeats search.
> >
> > This greately improve performance for > 400 clients in pgbench.
>
> So the performance difference between the existing code and v11 is the
> latter has a collision cross section eight times smaller than the
> former?

No. Acquiring EvictPartitionLock
1. doesn't block readers, since readers doesn't acquire EvictPartitionLock
2. doesn't form "tree of lock dependency" since EvictPartitionLock is
independent from PartitionLock.

Problem with existing code:
1. Process A locks P1 and P2
2. Process B (p3-old, p1-new) locks P3 and wants to lock P1
3. Process C (p4-new, p1-old) locks P4 and wants to lock P1
4. Process D (p5-new, p4-old) locks P5 and wants to lock P4
At this moment locks P1, P2, P3, P4 and P5 are all locked and waiting
for Process A.
And readers can't read from same five partitions.

With new code:
1. Process A locks E1 (evict partition) and locks P2,
then releases P2 and locks P1.
2. Process B tries to locks E1, waits and retries search.
3. Process C locks E4, locks P1, then releases P1 and locks P4
4. Process D locks E5, locks P4, then releases P4 and locks P5
So, there is no network of locks.
Process A doesn't block Process D in any moment:
- either A blocks C, but C doesn't block D at this moment
- or A doesn't block C.
And readers doesn't see simultaneously locked five locks which
depends on single Process A.

> + * Prevent "thundering herd" problem and limit concurrency.
>
> this is something like pressing accelerator and break pedals at the
> same time. If it improves performance, just increasing the number of
> buffer partition seems to work?

To be honestly: of cause simple increase of NUM_BUFFER_PARTITIONS
does improve average case.
But it is better to cure problem than anesthetize.
Increase of
NUM_BUFFER_PARTITIONS reduces probability and relative
weight of lock network, but doesn't eliminate.

> It's also not great that follower backends runs a busy loop on the
> lock until the top-runner backend inserts the new buffer to the
> buftable then releases the newParititionLock.
>
> > I tried other variant as well:
> > - first insert entry with dummy buffer index into buffer table.
> > - if such entry were already here, then wait it to be filled.
> > - otherwise find victim buffer and replace dummy index with new one.
> > Wait were done with shared lock on EvictPartitionLock as well.
> > This variant performed quite same.
>
> This one looks better to me. Since a partition can be shared by two or
> more new-buffers, condition variable seems to work better here...
>
> > Logically I like that variant more, but there is one gotcha:
> > FlushBuffer could fail with elog(ERROR). Therefore then there is
> > a need to reliable remove entry with dummy index.
>
> Perhaps UnlockBuffers can do that.

Thanks for suggestion. I'll try to investigate and retry this way
of patch.

> > And after all, I still need to hold EvictPartitionLock to notice
> > waiters.
> > I've tried to use ConditionalVariable, but its performance were much
> > worse.
>
> How many CVs did you use?

I've tried both NUM_PARTITION_LOCKS and NUM_PARTITION_LOCKS*8.
It doesn't matter.
Looks like use of WaitLatch (which uses epoll) and/or tripple
SpinLockAcquire per good case (with two list traversing) is much worse
than PgSemaphorLock (which uses futex) and single wait list action.

Other probability is while ConditionVariable eliminates thundering
nerd effect, it doesn't limit concurrency enough... but that's just
theory.

In reality, I'd like to try to make BufferLookupEnt->id to be atomic
and add LwLock to BufferLookupEnt. I'll test it, but doubt it could
be merged, since there is no way to initialize dynahash's entries
reliably.

> > # Dynahash capacity and freelists.
> >
> > I returned back buffer table initialization:
> > - removed FIXES_SIZE restriction introduced in previous version
>
> Mmm. I don't see v10 in this list and v9 doesn't contain FIXES_SIZE..

v9 contains HASH_FIXED_SIZE - line 815 of patch, PATCH 3/4 "fixed BufTable".

> > - returned `NBuffers + NUM_BUFFER_PARTITIONS`.
> > I really think, there should be more spare items, since almost always
> > entry_alloc is called at least once (on 128MB shared_buffers). But
> > let keep it as is for now.
>
> Maybe s/entry_alloc/element_alloc/ ? :p

:p yes

> I see it with shared_buffers=128kB (not MB) and pgbench -i on master.
>
> The required number of elements are already allocaed to freelists at
> hash creation. So the reason for the call is imbalanced use among
> freelists. Even in that case other freelists holds elements. So we
> don't need to expand the element size.
>
> > `get_hash_entry` were changed to probe NUM_FREELISTS/4 (==8) freelists
> > before falling back to `entry_alloc`, and probing is changed from
> > linear to quadratic. This greately reduces number of calls to
> > `entry_alloc`, so more shared memory left intact. And I didn't notice
> > large performance hit from. Probably there is some, but I think it is
> > adequate trade-off.
>
> I don't think that causes significant performance hit, but I don't
> understand how it improves freelist hit ratio other than by accident.
> Could you have some reasoning for it?

Since free_reused_entry returns entry into random free_list, this
probability is quite high. In tests, I see stabilisa

> By the way the change of get_hash_entry looks something wrong.
>
> If I understand it correctly, it visits num_freelists/4 freelists at
> once, then tries element_alloc. If element_alloc() fails (that must
> happen), it only tries freeList[freelist_idx] and gives up, even
> though there must be an element in other 3/4 freelists.

No. If element_alloc fails, it tries all NUM_FREELISTS again.
- condition: `ntries || !allocFailed`. `!allocFailed` become true,
so `ntries` remains.
- `ntries = num_freelists;` regardless of `allocFailed`.
Therefore, all `NUM_FREELISTS` are retried for partitioned table.

>
> > `free_reused_entry` now returns entry to random position. It flattens
> > free entry's spread. Although it is not enough without other changes
> > (thundering herd mitigation and probing more lists in get_hash_entry).
>
> If "thudering herd" means "many backends rush trying to read-in the
> same page at once", isn't it avoided by the change in BufferAlloc?

"thundering herd" reduces speed of entries migration a lot. But
`simple_select` benchmark is too biased: looks like btree root is
evicted from time to time. So entries are slowly migrated to of from
freelist of its partition.
Without "thundering herd" fix this migration is very fast.

> I feel the random returning method might work. I want to get rid of
> the randomness here but I don't come up with a better way.
>
> Anyway the code path is used only by buftable so it doesn't harm
> generally.
>
> > # Benchmarks
>
> # Thanks for benchmarking!!
>
> > Benchmarked on two socket Xeon(R) Gold 5220 CPU @2.20GHz
> > 18 cores per socket + hyper-threading - upto 72 virtual core total.
> > turbo-boost disabled
> > Linux 5.10.103-1 Debian.
> >
> > pgbench scale 100 simple_select + simple select with 3 keys (sql file
> > attached).
> >
> > shared buffers 128MB & 1GB
> > huge_pages=on
> >
> > 1 socket
> > conns | master | patch-v11 | master 1G | patch-v11 1G
> > --------+------------+------------+------------+------------
> > 1 | 27882 | 27738 | 32735 | 32439
> > 2 | 54082 | 54336 | 64387 | 63846
> > 3 | 80724 | 81079 | 96387 | 94439
> > 5 | 134404 | 133429 | 160085 | 157399
> > 7 | 185977 | 184502 | 219916 | 217142
>
> v11+128MB degrades above here..

+ 1GB?

>
> > 17 | 335345 | 338214 | 393112 | 388796
> > 27 | 393686 | 394948 | 447945 | 444915
> > 53 | 572234 | 577092 | 678884 | 676493
> > 83 | 558875 | 561689 | 669212 | 655697
> > 107 | 553054 | 551896 | 654550 | 646010
> > 139 | 541263 | 538354 | 641937 | 633840
> > 163 | 532932 | 531829 | 635127 | 627600
> > 191 | 524647 | 524442 | 626228 | 617347
> > 211 | 521624 | 522197 | 629740 | 613143
>
> v11+1GB degrades above here..
>
> > 239 | 509448 | 554894 | 652353 | 652972
> > 271 | 468190 | 557467 | 647403 | 661348
> > 307 | 454139 | 558694 | 642229 | 657649
> > 353 | 446853 | 554301 | 635991 | 654571
> > 397 | 441909 | 549822 | 625194 | 647973
> >
> > 1 socket 3 keys
> >
> > conns | master | patch-v11 | master 1G | patch-v11 1G
> > --------+------------+------------+------------+------------
> > 1 | 16677 | 16477 | 22219 | 22030
> > 2 | 32056 | 31874 | 43298 | 43153
> > 3 | 48091 | 47766 | 64877 | 64600
> > 5 | 78999 | 78609 | 105433 | 106101
> > 7 | 108122 | 107529 | 148713 | 145343
>
> v11+128MB degrades above here..
>
> > 17 | 205656 | 209010 | 272676 | 271449
> > 27 | 252015 | 254000 | 323983 | 323499
>
> v11+1GB degrades above here..
>
> > 53 | 317928 | 334493 | 446740 | 449641
> > 83 | 299234 | 327738 | 437035 | 443113
> > 107 | 290089 | 322025 | 430535 | 431530
> > 139 | 277294 | 314384 | 422076 | 423606
> > 163 | 269029 | 310114 | 416229 | 417412
> > 191 | 257315 | 306530 | 408487 | 416170
> > 211 | 249743 | 304278 | 404766 | 416393
> > 239 | 243333 | 310974 | 397139 | 428167
> > 271 | 236356 | 309215 | 389972 | 427498
> > 307 | 229094 | 307519 | 382444 | 425891
> > 353 | 224385 | 305366 | 375020 | 423284
> > 397 | 218549 | 302577 | 364373 | 420846
> >
> > 2 sockets
> >
> > conns | master | patch-v11 | master 1G | patch-v11 1G
> > --------+------------+------------+------------+------------
> > 1 | 27287 | 27631 | 32943 | 32493
> > 2 | 52397 | 54011 | 64572 | 63596
> > 3 | 76157 | 80473 | 93363 | 93528
> > 5 | 127075 | 134310 | 153176 | 149984
> > 7 | 177100 | 176939 | 216356 | 211599
> > 17 | 379047 | 383179 | 464249 | 470351
> > 27 | 545219 | 546706 | 664779 | 662488
> > 53 | 728142 | 728123 | 857454 | 869407
> > 83 | 918276 | 957722 | 1215252 | 1203443
>
> v11+1GB degrades above here..
>
> > 107 | 884112 | 971797 | 1206930 | 1234606
> > 139 | 822564 | 970920 | 1167518 | 1233230
> > 163 | 788287 | 968248 | 1130021 | 1229250
> > 191 | 772406 | 959344 | 1097842 | 1218541
> > 211 | 756085 | 955563 | 1077747 | 1209489
> > 239 | 732926 | 948855 | 1050096 | 1200878
> > 271 | 692999 | 941722 | 1017489 | 1194012
> > 307 | 668241 | 920478 | 994420 | 1179507
> > 353 | 642478 | 908645 | 968648 | 1174265
> > 397 | 617673 | 893568 | 950736 | 1173411
> >
> > 2 sockets 3 keys
> >
> > conns | master | patch-v11 | master 1G | patch-v11 1G
> > --------+------------+------------+------------+------------
> > 1 | 16722 | 16393 | 20340 | 21813
> > 2 | 32057 | 32009 | 39993 | 42959
> > 3 | 46202 | 47678 | 59216 | 64374
> > 5 | 78882 | 72002 | 98054 | 103731
> > 7 | 103398 | 99538 | 135098 | 135828
>
> v11+128MB degrades above here..
>
> > 17 | 205863 | 217781 | 293958 | 299690
> > 27 | 283526 | 290539 | 414968 | 411219
> > 53 | 336717 | 356130 | 460596 | 474563
> > 83 | 307310 | 342125 | 419941 | 469989
> > 107 | 294059 | 333494 | 405706 | 469593
> > 139 | 278453 | 328031 | 390984 | 470553
> > 163 | 270833 | 326457 | 384747 | 470977
> > 191 | 259591 | 322590 | 376582 | 470335
> > 211 | 263584 | 321263 | 375969 | 469443
> > 239 | 257135 | 316959 | 370108 | 470904
> > 271 | 251107 | 315393 | 365794 | 469517
> > 307 | 246605 | 311585 | 360742 | 467566
> > 353 | 236899 | 308581 | 353464 | 466936
> > 397 | 249036 | 305042 | 344673 | 466842
> >
> > I skipped v10 since I used it internally for variant
> > "insert entry with dummy index then search victim".
>
> Up to about 15%(?) of gain is great.

Up to 35% in "2 socket 3 key 1GB" case.
Up to 44% in "2 socket 1 key 128MB" case.

> I'm not sure it is okay that it seems to slow by about 1%..

Well, in fact some degradation is not reproducible.
Surprisingly, results change a bit from time to time.
I just didn't rerun whole `master` branch bench again
after v11 bench, since each whole test run costs me 1.5 hour.

But I confirm regression on "1 socket 1 key 1GB" test case
between 83 and 211 connections. It were reproducible on
more powerful Xeon 8354H, although it were less visible.

Other fluctuations close to 1% are not reliable.
For example, sometimes I see degradation or improvement with
2GB shared buffers (and even more than 1%). But 2GB is enough
for whole test dataset (scale 100 pgbench is 1.5GB on disk).
Therefore modified code is not involved in benchmarking at all.
How it could be explained?
That is why I don't post 2GB benchmark results. (yeah, I'm
cheating a bit).

> Ah, I see.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Frédéric Yhuel 2022-04-07 11:37:57 Re: REINDEX blocks virtually any queries but some prepared queries.
Previous Message alias 2022-04-07 10:51:57 Re: aggregate array broken in postgresql 15.