Re: BufferAlloc: don't take two simultaneous locks

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, michail(dot)nikolaev(at)gmail(dot)com, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2022-06-28 11:13:06
Message-ID: 0b35f32057974441f30d5f94aef87e319498c260.camel@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Good day, hackers.

This is continuation of BufferAlloc saga.

This time I've tried to implement approach:
- if there's no buffer, insert placeholder
- then find victim
- if other backend wants to insert same buffer, it waits on
ConditionVariable.

Patch make separate ConditionVariable per backend, and placeholder
contains backend id. So waiters don't suffer from collision on
partition, they wait exactly for concrete buffer.

This patch doesn't contain any dynahash changes since order of
operation doesn't change: "insert then delete". So there is no way to
"reserve" entry.

But it contains changes to ConditionVariable:

- adds ConditionVariableSleepOnce, which doesn't reinsert process back
on CV's proclist.
This method could not be used in loop as ConditionVariableSleep,
and ConditionVariablePrepareSleep must be called before.

- adds ConditionVariableBroadcastFast - improvement over regular
ConditionVariableBroadcast that awakes processes in batches.
So CVBroadcastFast doesn't acquire/release CV's spinlock mutex for
every proclist entry, but rather for batch of entries.

I believe, it could safely replace ConditionVariableBroadcast. Though
I didn't try yet to replace and check.

Tests:
- tests done on 2 socket Xeon 5220 2.20GHz with turbo bust disabled
(ie max frequency is 2.20GHz)
- runs on 1 socket or 2 sockets using numactl
- pgbench scale 100 - 1.5GB of data
- shared_buffers : 128MB, 1GB (and 2GB)
- variations of simple_select with 1 key per query, 3 keys per query
and 10 keys per query.

1 socket 1 key

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 25670 | 24926 | 29491 | 28858
2 | 50157 | 48894 | 58356 | 57180
3 | 75036 | 72904 | 87152 | 84869
5 | 124479 | 120720 | 143550 | 140799
7 | 168586 | 164277 | 199360 | 195578
17 | 319943 | 314010 | 364963 | 358550
27 | 423617 | 420528 | 491493 | 485139
53 | 491357 | 490994 | 574477 | 571753
83 | 487029 | 486750 | 571057 | 566335
107 | 478429 | 479862 | 565471 | 560115
139 | 467953 | 469981 | 556035 | 551056
163 | 459467 | 463272 | 548976 | 543660
191 | 448420 | 456105 | 540881 | 534556
211 | 440229 | 458712 | 545195 | 535333
239 | 431754 | 471373 | 547111 | 552591
271 | 421767 | 473479 | 544014 | 557910
307 | 408234 | 474285 | 539653 | 556629
353 | 389360 | 472491 | 534719 | 554696
397 | 377063 | 471513 | 527887 | 554383

1 socket 3 keys

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 15277 | 14917 | 20109 | 19564
2 | 29587 | 28892 | 39430 | 36986
3 | 44204 | 43198 | 58993 | 57196
5 | 71471 | 68703 | 96923 | 92497
7 | 98823 | 97823 | 133173 | 130134
17 | 201351 | 198865 | 258139 | 254702
27 | 254959 | 255503 | 338117 | 339044
53 | 277048 | 291923 | 384300 | 390812
83 | 251486 | 287247 | 376170 | 385302
107 | 232037 | 281922 | 365585 | 380532
139 | 210478 | 276544 | 352430 | 373815
163 | 193875 | 271842 | 341636 | 368034
191 | 179544 | 267033 | 334408 | 362985
211 | 172837 | 269329 | 330287 | 366478
239 | 162647 | 272046 | 322646 | 371807
271 | 153626 | 271423 | 314017 | 371062
307 | 144122 | 270540 | 305358 | 370462
353 | 129544 | 268239 | 292867 | 368162
397 | 123430 | 267112 | 284394 | 366845

1 socket 10 keys

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 6824 | 6735 | 10475 | 10220
2 | 13037 | 12628 | 20382 | 19849
3 | 19416 | 19043 | 30369 | 29554
5 | 31756 | 30657 | 49402 | 48614
7 | 42794 | 42179 | 67526 | 65071
17 | 91443 | 89772 | 139630 | 139929
27 | 107751 | 110689 | 165996 | 169955
53 | 97128 | 120621 | 157670 | 184382
83 | 82344 | 117814 | 142380 | 183863
107 | 70764 | 115841 | 134266 | 182426
139 | 57561 | 112528 | 125090 | 180121
163 | 50490 | 110443 | 119932 | 178453
191 | 45143 | 108583 | 114690 | 175899
211 | 42375 | 107604 | 111444 | 174109
239 | 39861 | 106702 | 106253 | 172410
271 | 37398 | 105819 | 102260 | 170792
307 | 35279 | 105355 | 97164 | 168313
353 | 33427 | 103537 | 91629 | 166232
397 | 31778 | 101793 | 87230 | 164381

2 sockets 1 key

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 24839 | 24386 | 29246 | 28361
2 | 46655 | 45265 | 55942 | 54327
3 | 69278 | 68332 | 83984 | 81608
5 | 115263 | 112746 | 139012 | 135426
7 | 159881 | 155119 | 193846 | 188399
17 | 373808 | 365085 | 456463 | 441603
27 | 503663 | 495443 | 600335 | 584741
53 | 708849 | 744274 | 900923 | 908488
83 | 593053 | 862003 | 985953 | 1038033
107 | 431806 | 875704 | 957115 | 1075172
139 | 328380 | 879890 | 881652 | 1069872
163 | 288339 | 874792 | 824619 | 1064047
191 | 255666 | 870532 | 790583 | 1061124
211 | 241230 | 865975 | 764898 | 1058473
239 | 227344 | 857825 | 732353 | 1049745
271 | 216095 | 848240 | 703729 | 1043182
307 | 206978 | 833980 | 674711 | 1031533
353 | 198426 | 803830 | 633783 | 1018479
397 | 191617 | 744466 | 599170 | 1006134

2 sockets 3 keys

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 14688 | 14088 | 18912 | 18905
2 | 26759 | 25925 | 36817 | 35924
3 | 40002 | 38658 | 54765 | 53266
5 | 63479 | 63041 | 90521 | 87496
7 | 88561 | 87101 | 123425 | 121877
17 | 199411 | 196932 | 289555 | 282146
27 | 270121 | 275950 | 386884 | 383019
53 | 202918 | 374848 | 395967 | 501648
83 | 149599 | 363623 | 335815 | 478628
107 | 126501 | 348125 | 311617 | 472473
139 | 106091 | 331350 | 279843 | 466408
163 | 95497 | 321978 | 260884 | 461688
191 | 87427 | 312815 | 241189 | 458252
211 | 82783 | 307261 | 231435 | 454327
239 | 78930 | 299661 | 219655 | 451826
271 | 74081 | 294233 | 211555 | 448412
307 | 71352 | 288133 | 202838 | 446143
353 | 67872 | 279948 | 193354 | 441929
397 | 66178 | 275784 | 185556 | 438330

2 sockets 10 keys

conns | master 128M | v12 128M | master 1G | v12 1G
--------+--------------+--------------+--------------+--------------
1 | 6200 | 6108 | 10163 | 9563
2 | 11196 | 10871 | 18373 | 17827
3 | 16479 | 16129 | 26807 | 26584
5 | 26750 | 26241 | 44291 | 43409
7 | 36501 | 35433 | 60508 | 59379
17 | 77320 | 77451 | 130413 | 128452
27 | 91833 | 105643 | 147259 | 156833
53 | 57138 | 115793 | 119306 | 150647
83 | 44435 | 108850 | 105454 | 148006
107 | 38031 | 105199 | 95108 | 146162
139 | 31697 | 101096 | 84011 | 143281
163 | 28826 | 98255 | 78411 | 141375
191 | 26223 | 96224 | 74256 | 139646
211 | 24933 | 94815 | 71542 | 137834
239 | 23626 | 92849 | 69289 | 137235
271 | 22664 | 90938 | 66431 | 136080
307 | 21691 | 89358 | 64661 | 133166
353 | 20712 | 88239 | 61619 | 133339
397 | 20374 | 86708 | 58937 | 130684

Well, as you see, there is some regression on low connection numbers.
I don't get where it from.

More over, it is even in case of 2GB shared buffers - when all data
fits into buffers cache and new code doesn't work at all.
(except this incomprehensible regression there's no different in
performance with 2GB shared buffers).

For example 2GB shared buffers 1 socket 3 keys:
conns | master 2G | v12 2G
--------+--------------+--------------
1 | 23491 | 22621
2 | 46436 | 44851
3 | 69265 | 66844
5 | 112432 | 108801
7 | 158859 | 150247
17 | 297600 | 291605
27 | 390041 | 384590
53 | 448384 | 447588
83 | 445582 | 442048
107 | 440544 | 438200
139 | 433893 | 430818
163 | 427436 | 424182
191 | 420854 | 417045
211 | 417228 | 413456

Perhaps something changes in memory layout due to array of CV's, or
compiler layouts/optimizes functions differently. I can't find the
reason ;-( I would appreciate help on this.

regards

---

Yura Sokolov

Attachment Content-Type Size
v12-bufmgr-lock-improvements.patch text/x-patch 28.5 KB
image/gif 11.8 KB
image/gif 12.2 KB
image/gif 12.3 KB
image/gif 12.7 KB
image/gif 12.3 KB
image/gif 12.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yura Sokolov 2022-06-28 11:26:54 Re: BufferAlloc: don't take two simultaneous locks
Previous Message Simon Riggs 2022-06-28 10:02:25 Re: Allowing REINDEX to have an optional name