Re: BufferAlloc: don't take two simultaneous locks

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, michail(dot)nikolaev(at)gmail(dot)com, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2022-02-25 09:24:40
Message-ID: bcc57fce3a7623fb42f6baaf53b1d3b1df532542.camel@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, Simon.

В Пт, 25/02/2022 в 04:35 +0000, Simon Riggs пишет:
> On Mon, 21 Feb 2022 at 08:06, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> > Good day, Kyotaro Horiguchi and hackers.
> >
> > В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:
> > > At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote in
> > > > Hello, all.
> > > >
> > > > I thought about patch simplification, and tested version
> > > > without BufTable and dynahash api change at all.
> > > >
> > > > It performs suprisingly well. It is just a bit worse
> > > > than v1 since there is more contention around dynahash's
> > > > freelist, but most of improvement remains.
> > > >
> > > > I'll finish benchmarking and will attach graphs with
> > > > next message. Patch is attached here.
> > >
> > > Thanks for the new patch. The patch as a whole looks fine to me. But
> > > some comments needs to be revised.
> >
> > Thank you for review and remarks.
>
> v3 gets the buffer partition locking right, well done, great results!
>
> In v3, the comment at line 1279 still implies we take both locks
> together, which is not now the case.
>
> Dynahash actions are still possible. You now have the BufTableDelete
> before the BufTableInsert, which opens up the possibility I discussed
> here:
> http://postgr.es/m/CANbhV-F0H-8oB_A+m=55hP0e0QRL=RdDDQuSXMTFt6JPrdX+pQ@mail.gmail.com
> (Apologies for raising a similar topic, I hadn't noticed this thread
> before; thanks to Horiguchi-san for pointing this out).
>
> v1 had a horrible API (sorry!) where you returned the entry and then
> explicitly re-used it. I think we *should* make changes to dynahash,
> but not with the API you proposed.
>
> Proposal for new BufTable API
> BufTableReuse() - similar to BufTableDelete() but does NOT put entry
> back on freelist, we remember it in a private single item cache in
> dynahash
> BufTableAssign() - similar to BufTableInsert() but can only be
> executed directly after BufTableReuse(), fails with ERROR otherwise.
> Takes the entry from single item cache and re-assigns it to new tag
>
> In dynahash we have two new modes that match the above
> HASH_REUSE - used by BufTableReuse(), similar to HASH_REMOVE, but
> places entry on the single item cache, avoiding freelist
> HASH_ASSIGN - used by BufTableAssign(), similar to HASH_ENTER, but
> uses the entry from the single item cache, rather than asking freelist
> This last call can fail if someone else already inserted the tag, in
> which case it adds the single item cache entry back onto freelist
>
> Notice that single item cache is not in shared memory, so on abort we
> should give it back, so we probably need an extra API call for that
> also to avoid leaking an entry.

Why there is need for this? Which way backend could be forced to abort
between BufTableReuse and BufTableAssign in this code path? I don't
see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
something.

>
> Doing it this way allows us to
> * avoid touching freelists altogether in the common path - we know we
> are about to reassign the entry, so we do remember it - no contention
> from other backends, no borrowing etc..
> * avoid sharing the private details outside of the dynahash module
> * allows us to use the same technique elsewhere that we have
> partitioned hash tables
>
> This approach is cleaner than v1, but should also perform better
> because there will be a 1:1 relationship between a buffer and its
> dynahash entry, most of the time.

Thank you for suggestion. Yes, it is much clearer than my initial proposal.

Should I incorporate it to v4 patch? Perhaps, it could be a separate
commit in new version.

>
> With these changes, I think we will be able to *reduce* the number of
> freelists for partitioned dynahash from 32 to maybe 8, as originally
> speculated by Robert in 2016:
> https://www.postgresql.org/message-id/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
> since the freelists will be much less contended with the above approach
>
> It would be useful to see performance with a higher number of connections, >400.
>
> --
> Simon Riggs http://www.EnterpriseDB.com/

------

regards,
Yura Sokolov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2022-02-25 09:34:58 Re: Minimal logical decoding on standbys
Previous Message Kyotaro Horiguchi 2022-02-25 09:14:33 Re: BufferAlloc: don't take two simultaneous locks