Re: Buffer Manager and Contention

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: simon(dot)riggs(at)enterprisedb(dot)com
Cc: y(dot)sokolov(at)postgrespro(dot)ru, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buffer Manager and Contention
Date: 2022-02-25 03:29:16
Message-ID: 20220225.122916.2047611000642246768.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Fri, 25 Feb 2022 10:20:25 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> (I added Yura, as the author of a related patch)
>
> At Thu, 24 Feb 2022 12:58:23 +0000, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com> wrote in
> > Thinking about poor performance in the case where the data fits in
> > RAM, but the working set is too big for shared_buffers, I notice a
> > couple of things that seem bad in BufMgr, but don't understand why
> > they are like that.
> >
> > 1. If we need to allocate a buffer to a new block we do this in one
> > step, while holding both partition locks for the old and the new tag.
> > Surely it would cause less contention to make the old block/tag
> > invalid (after flushing), drop the old partition lock and then switch
> > to the new one? i.e. just hold one mapping partition lock at a time.
> > Is there a specific reason we do it this way?
>
> I'm not sure but I guess the developer wanted to make the operation
> atomic.
>
> Yura Sokolov is proposing a patch to separte the two partition locks.
>
> https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru
>
> And it seems to me viable for me and a benchmarking in the thread
> showed a good result. I'd appreciate your input on that thread.
>
> > 2. Possibly connected to the above, we issue BufTableInsert() BEFORE
> > we issue BufTableDelete(). That means we need extra entries in the
> > buffer mapping hash table to allow us to hold both the old and the new
> > at the same time, for a short period. The way dynahash.c works, we try
>
> Yes.
>
> > to allocate an entry from the freelist and if that doesn't work, we
> > begin searching ALL the freelists for free entries to steal. So if we
> > get enough people trying to do virtual I/O at the same time, then we
> > will hit a "freelist storm" where everybody is chasing the last few
> > entries. It would make more sense if we could do BufTableDelete()
>
> To reduce that overhead, Yura proposed a surgically modification on
> dynahash, but I didn't like that and the latest patch doesn't contain
> that part.
>
> > first, then hold onto the buffer mapping entry rather than add it to
> > the freelist, so we can use it again when we do BufTableInsert() very
> > shortly afterwards. That way we wouldn't need to search the freelist
> > at all. What is the benefit or reason of doing the Delete after the
> > Insert?
>
> Hmm. something like hash_swap_key() or hash_reinsert_entry()? That
> sounds reasonable. (Yura's proposal was taking out an entry from hash
> then attach it with a new key again.)
>
> > Put that another way, it looks like BufTable functions are used in a
> > way that mismatches against the way dynahash is designed.
> >
> > Thoughts?
>
> On the first part, I think Yura's patch works. On the second point,
> Yura already showed it gives a certain amount of gain if we do that.

On second thought, even if we have a new dynahash API to atomically
replace hash key, we need to hold two partition locks to do that. That
is contradicting our objective.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-02-25 03:31:15 Re: why do hash index builds use smgrextend() for new splitpoint pages
Previous Message Amit Kapila 2022-02-25 03:27:24 Re: pg_stat_get_replication_slot and pg_stat_get_subscription_worker incorrectly marked as proretset