Re: BufferAlloc: don't take two simultaneous locks

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, michail(dot)nikolaev(at)gmail(dot)com, x4mmm(at)yandex-team(dot)ru, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BufferAlloc: don't take two simultaneous locks
Date: 2022-02-25 09:51:22
Message-ID: c1715ad78880399140f7a1ff9b11b80fd9b5856f.camel@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, Andres

В Пт, 25/02/2022 в 00:04 -0800, Andres Freund пишет:
> Hi,
>
> On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:
> > From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
> > From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
> > Date: Mon, 21 Feb 2022 08:49:03 +0300
> > Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.
> >
> > Acquiring two partition locks leads to complex dependency chain that hurts
> > at high concurrency level.
> >
> > There is no need to hold both lock simultaneously. Buffer is pinned so
> > other processes could not select it for eviction. If tag is cleared and
> > buffer removed from old partition other processes will not find it.
> > Therefore it is safe to release old partition lock before acquiring
> > new partition lock.
>
> Yes, the current design is pretty nonsensical. It leads to really absurd stuff
> like holding the relation extension lock while we write out old buffer
> contents etc.
>
>
>
> > + * We have pinned buffer and we are single pinner at the moment so there
> > + * is no other pinners.
>
> Seems redundant.
>
>
> > We hold buffer header lock and exclusive partition
> > + * lock if tag is valid. Given these statements it is safe to clear tag
> > + * since no other process can inspect it to the moment.
> > + */
>
> Could we share code with InvalidateBuffer here? It's not quite the same code,
> but nearly the same.
>
>
> > + * The usage_count starts out at 1 so that the buffer can survive one
> > + * clock-sweep pass.
> > + *
> > + * We use direct atomic OR instead of Lock+Unlock since no other backend
> > + * could be interested in the buffer. But StrategyGetBuffer,
> > + * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
> > + * compare tag, and UnlockBufHdr does raw write to state. So we have to
> > + * spin if we found buffer locked.
>
> So basically the first half of of the paragraph is wrong, because no, we
> can't?

Logically, there are no backends that could be interesting in the buffer.
Physically they do LockBufHdr/UnlockBufHdr just to check they are not interesting.

> > + * Note that we write tag unlocked. It is also safe since there is always
> > + * check for BM_VALID when tag is compared.
>
>
> > */
> > buf->tag = newTag;
> > - buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> > - BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
> > - BUF_USAGECOUNT_MASK);
> > if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
> > - buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > + new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> > else
> > - buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> > -
> > - UnlockBufHdr(buf, buf_state);
> > + new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> >
> > - if (oldPartitionLock != NULL)
> > + buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
> > + while (unlikely(buf_state & BM_LOCKED))
>
> I don't think it's safe to atomic in arbitrary bits. If somebody else has
> locked the buffer header in this moment, it'll lead to completely bogus
> results, because unlocking overwrites concurrently written contents (which
> there shouldn't be any, but here there are)...

That is why there is safety loop in the case buf->state were locked just
after first optimistic atomic_fetch_or. 99.999% times this loop will not
have a job. But in case other backend did lock buf->state, loop waits
until it releases lock and retry atomic_fetch_or.

> And or'ing contents in also doesn't make sense because we it doesn't work to
> actually unset any contents?

Sorry, I didn't understand sentence :((

> Why don't you just use LockBufHdr/UnlockBufHdr?

This pair makes two atomic writes to memory. Two writes are heavier than
one write in this version (if optimistic case succeed).

But I thought to use Lock+UnlockBuhHdr instead of safety loop:

buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
if (unlikely(buf_state & BM_LOCKED))
{
buf_state = LockBufHdr(&buf->state);
UnlockBufHdr(&buf->state, buf_state | new_bits);
}

I agree this way code is cleaner. Will do in next version.

-----

regards,
Yura Sokolov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2022-02-25 10:16:30 Re: logical replication empty transactions
Previous Message Simon Riggs 2022-02-25 09:38:36 Re: BufferAlloc: don't take two simultaneous locks