Re: BufferAlloc: don't take two simultaneous locks

From: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, michail(dot)nikolaev(at)gmail(dot)com, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2022-02-25 04:35:49
Message-ID: CANbhV-EdmOjAzXVosn6RX35g2K6xJ1OThRtQBpsRfEHYxC9rNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 21 Feb 2022 at 08:06, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>
> Good day, Kyotaro Horiguchi and hackers.
>
> В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:
> > At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote in
> > > Hello, all.
> > >
> > > I thought about patch simplification, and tested version
> > > without BufTable and dynahash api change at all.
> > >
> > > It performs suprisingly well. It is just a bit worse
> > > than v1 since there is more contention around dynahash's
> > > freelist, but most of improvement remains.
> > >
> > > I'll finish benchmarking and will attach graphs with
> > > next message. Patch is attached here.
> >
> > Thanks for the new patch. The patch as a whole looks fine to me. But
> > some comments needs to be revised.
>
> Thank you for review and remarks.

v3 gets the buffer partition locking right, well done, great results!

In v3, the comment at line 1279 still implies we take both locks
together, which is not now the case.

Dynahash actions are still possible. You now have the BufTableDelete
before the BufTableInsert, which opens up the possibility I discussed
here:
http://postgr.es/m/CANbhV-F0H-8oB_A+m=55hP0e0QRL=RdDDQuSXMTFt6JPrdX+pQ@mail.gmail.com
(Apologies for raising a similar topic, I hadn't noticed this thread
before; thanks to Horiguchi-san for pointing this out).

v1 had a horrible API (sorry!) where you returned the entry and then
explicitly re-used it. I think we *should* make changes to dynahash,
but not with the API you proposed.

Proposal for new BufTable API
BufTableReuse() - similar to BufTableDelete() but does NOT put entry
back on freelist, we remember it in a private single item cache in
dynahash
BufTableAssign() - similar to BufTableInsert() but can only be
executed directly after BufTableReuse(), fails with ERROR otherwise.
Takes the entry from single item cache and re-assigns it to new tag

In dynahash we have two new modes that match the above
HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
places entry on the single item cache, avoiding freelist
HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
uses the entry from the single item cache, rather than asking freelist
This last call can fail if someone else already inserted the tag, in
which case it adds the single item cache entry back onto freelist

Notice that single item cache is not in shared memory, so on abort we
should give it back, so we probably need an extra API call for that
also to avoid leaking an entry.

Doing it this way allows us to
* avoid touching freelists altogether in the common path - we know we
are about to reassign the entry, so we do remember it - no contention
from other backends, no borrowing etc..
* avoid sharing the private details outside of the dynahash module
* allows us to use the same technique elsewhere that we have
partitioned hash tables

This approach is cleaner than v1, but should also perform better
because there will be a 1:1 relationship between a buffer and its
dynahash entry, most of the time.

With these changes, I think we will be able to *reduce* the number of
freelists for partitioned dynahash from 32 to maybe 8, as originally
speculated by Robert in 2016:
https://www.postgresql.org/message-id/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
since the freelists will be much less contended with the above approach

It would be useful to see performance with a higher number of connections, >400.

--
Simon Riggs http://www.EnterpriseDB.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2022-02-25 04:41:41 Re: Separate the result of \watch for each query execution (psql)
Previous Message Imseih (AWS), Sami 2022-02-25 04:27:19 Re: [BUG] Panic due to incorrect missingContrecPtr after promotion