Re: BufferAlloc: don't take two simultaneous locks

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, michail(dot)nikolaev(at)gmail(dot)com, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2022-04-21 20:24:24
Message-ID: CA+TgmobhUv-zotvjpZK2u-nz1H2W9PdR3BSpfjpWK=ugj4LRyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 21, 2022 at 5:04 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> $ pid=`ps x | awk '/checkpointer/ && !/awk/ { print $1 }'`
> $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
>
> $1 = 16512
>
> $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
> ...
> $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
>
> $1 = 20439
>
> $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
> ...
> $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
>
> $1 = 20541
>
> It stabilizes at 20541

Hmm. So is the existing comment incorrect? Remember, I was complaining
about this change:

--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -481,10 +481,10 @@ StrategyInitialize(bool init)
*
* Since we can't tolerate running out of lookup table entries, we must be
* sure to specify an adequate table size here. The maximum steady-state
- * usage is of course NBuffers entries, but BufferAlloc() tries to insert
- * a new entry before deleting the old. In principle this could be
- * happening in each partition concurrently, so we could need as many as
- * NBuffers + NUM_BUFFER_PARTITIONS entries.
+ * usage is of course NBuffers entries. But due to concurrent
+ * access to numerous free lists in dynahash we can miss free entry that
+ * moved between free lists. So it is better to have some spare free entries
+ * to reduce probability of entry allocations after server start.
*/
InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

Pre-patch, the comment claims that the maximum number of buffer
entries that can be simultaneously used is limited to NBuffers +
NUM_BUFFER_PARTITIONS, and that's why we make the hash table that
size. The idea is that we normally need more than 1 entry per buffer,
but sometimes we might have 2 entries for the same buffer if we're in
the process of changing the buffer tag, because we make the new entry
before removing the old one. To change the buffer tag, we need the
buffer mapping lock for the old partition and the new one, but if both
are the same, we need only one buffer mapping lock. That means that in
the worst case, you could have a number of processes equal to
NUM_BUFFER_PARTITIONS each in the process of changing the buffer tag
between values that both fall into the same partition, and thus each
using 2 entries. Then you could have every other buffer in use and
thus using 1 entry, for a total of NBuffers + NUM_BUFFER_PARTITIONS
entries. Now I think you're saying we go far beyond that number, and
what I wonder is how that's possible. If the system doesn't work the
way the comment says it does, maybe we ought to start by talking about
what to do about that.

I am a bit confused by your description of having done "p
SharedBufHash->hctl->allocated.value" because SharedBufHash is of type
HTAB and HTAB's hctl member is of type HASHHDR, which has no field
called "allocated". I thought maybe my analysis here was somehow
mistaken, so I tried the debugger, which took the same view of it that
I did:

(lldb) p SharedBufHash->hctl->allocated.value
error: <user expression 0>:1:22: no member named 'allocated' in 'HASHHDR'
SharedBufHash->hctl->allocated.value
~~~~~~~~~~~~~~~~~~~ ^

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2022-04-21 20:25:13 Re: Replace open mode with PG_BINARY_R/W/A macros
Previous Message Tom Lane 2022-04-21 20:24:04 Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)