Re: BufferAlloc: don't take two simultaneous locks

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, michail(dot)nikolaev(at)gmail(dot)com, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2022-04-21 22:58:07
Message-ID: 3222ea2f83d59fdb1c765cc8fc22271163e0a3dc.camel@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

В Чт, 21/04/2022 в 16:24 -0400, Robert Haas пишет:
> On Thu, Apr 21, 2022 at 5:04 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> > $ pid=`ps x | awk '/checkpointer/ && !/awk/ { print $1 }'`
> > $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> >
> > $1 = 16512
> >
> > $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
> > ...
> > $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> >
> > $1 = 20439
> >
> > $ install/bin/pgbench -c 600 -j 800 -T 10 -P 1 -S -M prepared postgres
> > ...
> > $ gdb -p $pid -batch -ex 'p SharedBufHash->hctl->allocated.value'
> >
> > $1 = 20541
> >
> > It stabilizes at 20541
>
> Hmm. So is the existing comment incorrect?

It is correct and incorrect at the same time. Logically it is correct.
And it is correct in practice if HASH_FIXED_SIZE is set for SharedBufHash
(which is not currently). But setting HASH_FIXED_SIZE hurts performance
with low number of spare items.

> Remember, I was complaining
> about this change:
>
> --- a/src/backend/storage/buffer/freelist.c
> +++ b/src/backend/storage/buffer/freelist.c
> @@ -481,10 +481,10 @@ StrategyInitialize(bool init)
> *
> * Since we can't tolerate running out of lookup table entries, we must be
> * sure to specify an adequate table size here. The maximum steady-state
> - * usage is of course NBuffers entries, but BufferAlloc() tries to insert
> - * a new entry before deleting the old. In principle this could be
> - * happening in each partition concurrently, so we could need as many as
> - * NBuffers + NUM_BUFFER_PARTITIONS entries.
> + * usage is of course NBuffers entries. But due to concurrent
> + * access to numerous free lists in dynahash we can miss free entry that
> + * moved between free lists. So it is better to have some spare free entries
> + * to reduce probability of entry allocations after server start.
> */
> InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
>
> Pre-patch, the comment claims that the maximum number of buffer
> entries that can be simultaneously used is limited to NBuffers +
> NUM_BUFFER_PARTITIONS, and that's why we make the hash table that
> size. The idea is that we normally need more than 1 entry per buffer,
> but sometimes we might have 2 entries for the same buffer if we're in
> the process of changing the buffer tag, because we make the new entry
> before removing the old one. To change the buffer tag, we need the
> buffer mapping lock for the old partition and the new one, but if both
> are the same, we need only one buffer mapping lock. That means that in
> the worst case, you could have a number of processes equal to
> NUM_BUFFER_PARTITIONS each in the process of changing the buffer tag
> between values that both fall into the same partition, and thus each
> using 2 entries. Then you could have every other buffer in use and
> thus using 1 entry, for a total of NBuffers + NUM_BUFFER_PARTITIONS
> entries. Now I think you're saying we go far beyond that number, and
> what I wonder is how that's possible. If the system doesn't work the
> way the comment says it does, maybe we ought to start by talking about
> what to do about that.

At the master state:
- SharedBufHash is not declared as HASH_FIXED_SIZE
- get_hash_entry falls back to element_alloc too fast (just if it doesn't
found free entry in current freelist partition).
- get_hash_entry has races.
- if there are small number of spare items (and NUM_BUFFER_PARTITIONS is
small number) and HASH_FIXED_SIZE is set, it becomes contended and
therefore slow.

HASH_REUSE solves (for shared buffers) most of this issues. Free list
became rare fallback, so HASH_FIXED_SIZE for SharedBufHash doesn't lead
to performance hit. And with fair number of spare items, get_hash_entry
will find free entry despite its races.

> I am a bit confused by your description of having done "p
> SharedBufHash->hctl->allocated.value" because SharedBufHash is of type
> HTAB and HTAB's hctl member is of type HASHHDR, which has no field
> called "allocated".

Previous letter contains links to small patches that I used for
experiments. Link that adds "allocated" is https://pastebin.com/c5z0d5mz

> I thought maybe my analysis here was somehow
> mistaken, so I tried the debugger, which took the same view of it that
> I did:
>
> (lldb) p SharedBufHash->hctl->allocated.value
> error: <user expression 0>:1:22: no member named 'allocated' in 'HASHHDR'
> SharedBufHash->hctl->allocated.value
> ~~~~~~~~~~~~~~~~~~~ ^

-----

regards

Yura Sokolov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-04-21 23:28:01 Re: More problems with VacuumPageHit style global variables
Previous Message Tom Lane 2022-04-21 21:59:33 Re: Assert failure in CTE inlining with view and correlated subquery