Re: BufferAlloc: don't take two simultaneous locks

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, michail(dot)nikolaev(at)gmail(dot)com, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2022-04-14 13:46:03
Message-ID: CA+TgmoYTpVO=PoHCay77OAmRf3Kx734-6pKMrEAGw6DRjJ2oMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 6, 2022 at 9:17 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> I skipped v10 since I used it internally for variant
> "insert entry with dummy index then search victim".

Hi,

I think there's a big problem with this patch:

--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -481,10 +481,10 @@ StrategyInitialize(bool init)
*
* Since we can't tolerate running out of lookup table entries, we must be
* sure to specify an adequate table size here. The maximum steady-state
- * usage is of course NBuffers entries, but BufferAlloc() tries to insert
- * a new entry before deleting the old. In principle this could be
- * happening in each partition concurrently, so we could need as many as
- * NBuffers + NUM_BUFFER_PARTITIONS entries.
+ * usage is of course NBuffers entries. But due to concurrent
+ * access to numerous free lists in dynahash we can miss free entry that
+ * moved between free lists. So it is better to have some spare free entries
+ * to reduce probability of entry allocations after server start.
*/
InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);

With the existing system, there is a hard cap on the number of hash
table entries that we can ever need: one per buffer, plus one per
partition to cover the "extra" entries that are needed while changing
buffer tags. With the patch, the number of concurrent buffer tag
changes is no longer limited by NUM_BUFFER_PARTITIONS, because you
release the lock on the old buffer partition before acquiring the lock
on the new partition, and therefore there can be any number of
backends trying to change buffer tags at the same time. But that
means, as the comment implies, that there's no longer a hard cap on
how many hash table entries we might need. I don't think we can just
accept the risk that the hash table might try to allocate after
startup. If it tries, it might fail, because all of the extra shared
memory that we allocate at startup may already have been consumed, and
then somebody's query may randomly error out. That's not OK. It's true
that very few users are likely to be affected, because most people
won't consume the extra shared memory, and of those who do, most won't
hammer the system hard enough to cause an error.

However, I don't see us deciding that it's OK to ship something that
could randomly break just because it won't do so very often.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2022-04-14 13:47:58 Re: Skipping schema changes in publication
Previous Message S.R Keshav 2022-04-14 13:15:18 Re: GSOC: New and improved website for pgjdbc (JDBC) (2022)