Re: [Patch] Optimize dropping of relation buffers using dlist

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "tgl(at)sss(dot)pgh(dot)pa(dot)us" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>, "tomas(dot)vondra(at)2ndquadrant(dot)com" <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-12-21 13:25:11
Message-ID: CAA4eK1+aeBAKURe9Y8yZPFJkLp63_t8myuagJJZu+teSc9zk1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 19, 2020 at 12:37 PM tsunakawa(dot)takay(at)fujitsu(dot)com
<tsunakawa(dot)takay(at)fujitsu(dot)com> wrote:
>
> From: Andres Freund <andres(at)anarazel(dot)de>
>
> > Smaller comment:
> >
> > +static void
> > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber *forkNum,
> > int nforks,
> > + BlockNumber
> > *nForkBlocks, BlockNumber *firstDelBlock)
> > ...
> > + /* Check that it is in the buffer pool. If not, do nothing.
> > */
> > + LWLockAcquire(bufPartitionLock, LW_SHARED);
> > + buf_id = BufTableLookup(&bufTag, bufHash);
> > ...
> > + bufHdr = GetBufferDescriptor(buf_id);
> > +
> > + buf_state = LockBufHdr(bufHdr);
> > +
> > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
> > + bufHdr->tag.forkNum == forkNum[i] &&
> > + bufHdr->tag.blockNum >= firstDelBlock[i])
> > + InvalidateBuffer(bufHdr); /* releases
> > spinlock */
> > + else
> > + UnlockBufHdr(bufHdr, buf_state);
> >
> > I'm a bit confused about the check here. We hold a buffer partition lock, and
> > have done a lookup in the mapping table. Why are we then rechecking the
> > relfilenode/fork/blocknum? And why are we doing so holding the buffer header
> > lock, which is essentially a spinlock, so should only ever be held for very short
> > portions?
> >
> > This looks like it's copying logic from DropRelFileNodeBuffers() etc, but there
> > the situation is different: We haven't done a buffer mapping lookup, and we
> > don't hold a partition lock!
>
> That's because the buffer partition lock is released immediately after the hash table has been looked up. As an aside, InvalidateBuffer() requires the caller to hold the buffer header spinlock and doesn't hold the buffer partition lock.
>

This answers the second part of the question but what about the first
part (We hold a buffer partition lock, and have done a lookup in the
mapping table. Why are we then rechecking the
relfilenode/fork/blocknum?)

I think we don't need such a check, rather we can have an Assert
corresponding to that if-condition in the patch. I understand it is
safe to compare relfilenode/fork/blocknum but it might confuse readers
of the code.

I have started doing minor edits to the patch especially planning to
write a theory why is this optimization safe and here is what I can
come up with: "To remove all the pages of the specified relation forks
from the buffer pool, we need to scan the entire buffer pool but we
can optimize it by finding the buffers from BufMapping table provided
we know the exact size of each fork of the relation. The exact size is
required to ensure that we don't leave any buffer for the relation
being dropped as otherwise the background writer or checkpointer can
lead to a PANIC error while flushing buffers corresponding to files
that don't exist.

To know the exact size, we rely on the size cached for each fork by us
during recovery which limits the optimization to recovery and on
standbys but we can easily extend it once we have shared cache for
relation size.

In recovery, we cache the value returned by the first lseek(SEEK_END)
and the future writes keeps the cached value up-to-date. See
smgrextend. It is possible that the value of the first lseek is
smaller than the actual number of existing blocks in the file due to
buggy Linux kernels that might not have accounted for the recent
write. But that should be fine because there must not be any buffers
after that file size.

XXX We would make the extra lseek call for the unoptimized paths but
that is okay because we do it just for the first fork and we anyway
have to scan the entire buffer pool the cost of which is so high that
the extra lseek call won't make any visible difference. However, we
can use InRecovery flag to avoid the additional cost but that doesn't
seem worth it."

Thoughts?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2020-12-21 13:29:42 Re: Dependency isn't created between extension and schema
Previous Message Amit Kapila 2020-12-21 12:38:02 Re: Single transaction in the tablesync worker?