Re: BufferAlloc: don't take two simultaneous locks

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: michail(dot)nikolaev(at)gmail(dot)com, x4mmm(at)yandex-team(dot)ru, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2022-02-21 08:06:49
Message-ID: 1a97e80353d6855e9217cd6e2052257190a98f2d.camel@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Good day, Kyotaro Horiguchi and hackers.

В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:
> At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote in
> > Hello, all.
> >
> > I thought about patch simplification, and tested version
> > without BufTable and dynahash api change at all.
> >
> > It performs suprisingly well. It is just a bit worse
> > than v1 since there is more contention around dynahash's
> > freelist, but most of improvement remains.
> >
> > I'll finish benchmarking and will attach graphs with
> > next message. Patch is attached here.
>
> Thanks for the new patch. The patch as a whole looks fine to me. But
> some comments needs to be revised.

Thank you for review and remarks.

>
> (existing comments)
> > * To change the association of a valid buffer, we'll need to have
> > * exclusive lock on both the old and new mapping partitions.
> ...
> > * Somebody could have pinned or re-dirtied the buffer while we were
> > * doing the I/O and making the new hashtable entry. If so, we can't
> > * recycle this buffer; we must undo everything we've done and start
> > * over with a new victim buffer.
>
> We no longer take a lock on the new partition and have no new hash
> entry (if others have not yet done) at this point.

fixed

> + * Clear out the buffer's tag and flags. We must do this to ensure that
> + * linear scans of the buffer array don't think the buffer is valid. We
>
> The reason we can clear out the tag is it's safe to use the victim
> buffer at this point. This comment needs to mention that reason.

Tried to describe.

> + *
> + * Since we are single pinner, there should no be PIN_COUNT_WAITER or
> + * IO_IN_PROGRESS (flags that were not cleared in previous code).
> + */
> + Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
>
> It seems like to be a test for potential bugs in other functions. As
> the comment is saying, we are sure that no other processes are pinning
> the buffer and the existing code doesn't seem to be care about that
> condition. Is it really needed?

Ok, I agree this check is excess.
These two flags were not cleared in the previous code, and I didn't get
why. Probably, it is just a historical accident.

>
> + /*
> + * Try to make a hashtable entry for the buffer under its new tag. This
> + * could fail because while we were writing someone else allocated another
>
> The most significant point of this patch is the reason that the victim
> buffer is protected from stealing until it is set up for new tag. I
> think we need an explanation about the protection here.

I don't get what you mean clearly :( . I would appreciate your
suggestion for this comment.

>
>
> + * buffer for the same block we want to read in. Note that we have not yet
> + * removed the hashtable entry for the old tag.
>
> Since we have removed the hash table entry for the old tag at this
> point, the comment got wrong.

Thanks, changed.

> + * the first place. First, give up the buffer we were planning to use
> + * and put it to free lists.
> ..
> + StrategyFreeBuffer(buf);
>
> This is one downside of this patch. But it seems to me that the odds
> are low that many buffers are freed in a short time by this logic. By
> the way it would be better if the sentence starts with "First" has a
> separate comment section.

Splitted the comment.

> (existing comment)
> | * Okay, it's finally safe to rename the buffer.
>
> We don't "rename" the buffer here. And the safety is already
> establishsed at the end of the oldPartitionLock section. So it would
> be just something like "Now allocate the victim buffer for the new
> tag"?

Changed to "Now it is safe to use victim buffer for new tag."

There is also tiny code change at block reuse finalization: instead
of LockBufHdr+UnlockBufHdr I use single atomic_fetch_or protected
with WaitBufHdrUnlocked. I've tried to explain its safety. Please,
check it.

Benchmarks:
- base point is 6ce16088bfed97f9.
- notebook with i7-1165G7 and server with Xeon 8354H (1&2 sockets)
- pgbench select only scale 100 (1.5GB on disk)
- two shared_buffers values: 128MB and 1GB.
- enabled hugepages
- second best result from five runs

Notebook:
conns | master | patch_v3 | master 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 29508 | 29481 | 31774 | 32305
2 | 57139 | 56694 | 63393 | 62968
3 | 89759 | 90861 | 101873 | 102399
5 | 133491 | 134573 | 145451 | 145009
7 | 156739 | 155832 | 164633 | 164562
17 | 216863 | 216379 | 251923 | 251017
27 | 209532 | 209802 | 244202 | 243709
53 | 212615 | 213552 | 248107 | 250317
83 | 214446 | 218230 | 252414 | 252337
107 | 211276 | 217109 | 252762 | 250328
139 | 208070 | 214265 | 248350 | 249684
163 | 206764 | 214594 | 247369 | 250323
191 | 205478 | 213511 | 244597 | 246877
211 | 200976 | 212976 | 244035 | 245032
239 | 196588 | 211519 | 243897 | 245055
271 | 195813 | 209631 | 237457 | 242771
307 | 192724 | 208074 | 237658 | 241759
353 | 187847 | 207189 | 234548 | 239008
397 | 186942 | 205317 | 230465 | 238782

I don't get why numbers changed from first letter ))
But still no slowdown, and measurable gain at 128MB shared
buffers.

Xeon 1 socket

conns | master | patch_v3 | master 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 41975 | 41799 | 52898 | 52715
2 | 77693 | 77531 | 97571 | 98547
3 | 114713 | 114533 | 142709 | 143579
5 | 188898 | 187241 | 239322 | 236682
7 | 261516 | 260249 | 329119 | 328562
17 | 521821 | 518981 | 672390 | 662987
27 | 555487 | 557019 | 674630 | 675703
53 | 868213 | 897097 | 1190734 | 1202575
83 | 868232 | 881705 | 1164997 | 1157764
107 | 850477 | 855169 | 1140597 | 1128027
139 | 816311 | 826756 | 1101471 | 1096197
163 | 794788 | 805946 | 1078445 | 1071535
191 | 765934 | 783209 | 1059497 | 1039936
211 | 738656 | 786171 | 1083356 | 1049025
239 | 713124 | 837040 | 1104629 | 1125969
271 | 692138 | 847741 | 1094432 | 1131968
307 | 682919 | 847939 | 1086306 | 1124649
353 | 679449 | 844596 | 1071482 | 1125980
397 | 676217 | 833009 | 1058937 | 1113496

Here is small slowdown at some connection numbers (17,
107-191).It is reproducible. Probably it is due to one more
atomice write. Perhaps for some other scheduling issues (
processes block less on buffer manager but compete more
on other resources). I could not reliably determine why,
because change is too small, and `perf record` harms
performance more at this point.

This is the reason I've changed finalization to atomic_or
instead of Lock+Unlock pair. The changed helped a bit, but
didn't remove slowdown completely.

Xeon 2 socket

conns | m0 | patch_v3 | m0 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 44317 | 43747 | 53920 | 53759
2 | 81193 | 79976 | 99138 | 99213
3 | 120755 | 114481 | 148102 | 146494
5 | 190007 | 187384 | 232078 | 229627
7 | 258602 | 256657 | 325545 | 322417
17 | 551814 | 549041 | 692312 | 688204
27 | 787353 | 787916 | 1023509 | 1020995
53 | 973880 | 996019 | 1228274 | 1246128
83 | 1108442 | 1258589 | 1596292 | 1662586
107 | 1072188 | 1317229 | 1542401 | 1684603
139 | 1000446 | 1272759 | 1490757 | 1672507
163 | 967378 | 1224513 | 1461468 | 1660012
191 | 926010 | 1178067 | 1435317 | 1645886
211 | 909919 | 1148862 | 1417437 | 1629487
239 | 895944 | 1108579 | 1393530 | 1616824
271 | 880545 | 1078280 | 1374878 | 1608412
307 | 865560 | 1056988 | 1355164 | 1601066
353 | 857591 | 1033980 | 1330069 | 1586769
397 | 840374 | 1016690 | 1312257 | 1573376

regards,
Yura Sokolov
Postgres Professional
y(dot)sokolov(at)postgrespro(dot)ru
funny(dot)falcon(at)gmail(dot)com

Attachment Content-Type Size
v3-0001-PGPRO-5616-bufmgr-do-not-acquire-two-partition-lo.patch text/x-patch 9.5 KB
image/gif 13.8 KB
image/gif 14.1 KB
image/gif 12.2 KB
res.zip application/zip 46.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2022-02-21 08:30:07 Re: Patch a potential memory leak in describeOneTableDetails()
Previous Message Michael Paquier 2022-02-21 08:04:51 Re: Assert in pageinspect with NULL pages