Re: ReadRecentBuffer() doesn't scale well

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: ReadRecentBuffer() doesn't scale well
Date: 2023-06-27 04:09:31
Message-ID: 20230627040931.km2ek55e2ri4a3rv@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-06-27 15:33:57 +1200, Thomas Munro wrote:
> On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Unfortunately it scaled way worse at first. This is not an inherent issue, but
> > due to an implementation choice in ReadRecentBuffer(). Whereas the normal
> > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
> > LockBufHdr(), checks if the buffer ID is the same and then uses
> > PinBuffer_Locked().
> >
> > The problem with that is that PinBuffer() takes care to not hold the buffer
> > header spinlock, it uses compare_exchange to atomically acquire the pin, while
> > guaranteing nobody holds the lock. When holding the buffer header spinlock,
> > there obviously is the risk of being scheduled out (or even just not have
> > exclusive access to the cacheline).
>
> Yeah. Aside from inherent nastiness of user-space spinlocks

I've been wondering about making our backoff path use futexes, after some
adaptive spinning.

> > The fairly obvious solution to this is to just use PinBuffer() and just unpin
> > the buffer if its identity was changed concurrently. There could be an
> > unlocked pre-check as well. However, there's the following comment in
> > ReadRecentBuffer():
> > * It's now safe to pin the buffer. We can't pin first and ask
> > * questions later, because it might confuse code paths like
> > * InvalidateBuffer() if we pinned a random non-matching buffer.
> > */
> >
> > But I'm not sure I buy that - there's plenty other things that can briefly
> > acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
> > contents, etc).
>
> I may well have been too cautious with that. The worst thing I can
> think of right now is that InvalidateBuffer() would busy loop (as it
> already does in other rare cases) when it sees a pin.

Right. Particularly if we were to add a pre-check for the tag to match, that
should be extremely rare.

> > Another difference between using PinBuffer() and PinBuffer_locked() is that
> > the latter does not adjust a buffer's usagecount.
> >
> > Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
> > codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
> > increasing usagecount anymore?
>
> Yeah, that is not great. The simplification you suggest would fix
> that too, though I guess it would also bump the usage count of buffers
> that don't have the tag we expected; that's obviously rare and erring
> on a better side though.

Yea, I'm not worried about that. If somebody is, we could just add code to
decrement the usagecount again.

> > FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
> > the root page's buffer id in RelationData, seems a noticeable win. About 7% in
> > a concurrent, read-only pgbench that utilizes batches of 10. And it should be
> > easy to get much bigger wins, e.g. with a index nested loop with a relatively
> > small index on the inner side.
>
> Wooo, that's better than I was hoping. Thanks for trying it out! I
> think, for the complexity involved (ie very little)

I don't really have a concrete thought for where to store the id of the recent
buffer. I just added a new field into some padding in RelationData, but we
might go for something fancier.

> smgr_targblock could be another easy-to-cache candidate, ie a place where
> there is a single interesting hot page that we're already keeping track of
> with no requirement for new backend-local mapping machinery.

I wonder if we should simple add a generic field for such a Buffer to
RelationData, that the AM can use as it desires. For btree that would be the
root page, for heap the target block ...

> it's a nice result, and worth considering even though it's also a solid clue
> that we could do much better than this with a (yet to be designed)
> longer-lived pin scheme.

Indeed. PinBuffer() is pretty hot after the change. As is the buffer content
lock.

Particularly for the root page, it'd be really interesting to come up with a
scheme that keeps an offline copy of the root page while also pinning the real
root page. I think we should be able to have a post-check that can figure out
if the copied root page is out of date after searching it, without needing the
content lock.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2023-06-27 04:12:28 Re: Improving btree performance through specializing by key shape, take 2
Previous Message Jaime Casanova 2023-06-27 04:05:43 Assert !bms_overlap(joinrel->relids, required_outer)