Re: [Patch] Optimize dropping of relation buffers using dlist

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "tgl(at)sss(dot)pgh(dot)pa(dot)us" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>, "tomas(dot)vondra(at)2ndquadrant(dot)com" <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-11-18 18:13:42
Message-ID: 20201118181342.h65kc3eqluzwfrgn@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-11-18 17:34:31 +0530, Amit Kapila wrote:
> Yeah, that won't be a bad idea especially because the patch being
> discussed in the thread you referred is still in an exploratory phase.
> I haven't tested or done a detailed review but I feel there shouldn't
> be many problems if we agree on the approach.
>
> Thomas/others, do you have objections to proceeding here? It shouldn't
> be a big problem to change the code in this area even if we get the
> shared relation size stuff in.

I'm doubtful the patches as is are a good idea / address the correctness
concerns to a sufficient degree.

One important part of that is that the patch includes pretty much zero
explanations about why it is safe what it is doing. Something having
being discussed deep in this thread won't help us in a few months, not
to speak of a few years.

The commit message says:
> While recovery, we can get a reliable cached value of nblocks for
> supplied relation's fork, and it's safe because there are no other
> processes but the startup process that changes the relation size
> during recovery.

and the code only applies the optimized scan only when cached:
+ /*
+ * Look up the buffers in the hashtable and drop them if the block size
+ * is already cached and the total blocks to be invalidated is below the
+ * full scan threshold. Otherwise, give up the optimization.
+ */
+ if (cached && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)

This seems quite narrow to me. There's plenty cases where there's no
cached relation size in the startup process, restricting the
availability of this optimization as written. Where do we even use
DropRelFileNodeBuffers() in recovery? The most common path is
DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers(),
which 3/4 doesn't address and 4/4 doesn't mention.

4/4 seems to address DropRelationFiles(), but only talks about TRUNCATE?

I'm also worried about the cases where this could cause buffers left in
the buffer pool, without a crosscheck like Thomas' patch would allow to
add. Obviously other processes can dirty buffers in hot_standby, so any
leftover buffer could have bad consequences.

I also don't get why 4/4 would be a good idea on its own. It uses
BUF_DROP_FULL_SCAN_THRESHOLD to guard FindAndDropRelFileNodeBuffers() on
a per relation basis. But since DropRelFileNodesAllBuffers() can be used
for many relations at once, this could end up doing
BUF_DROP_FULL_SCAN_THRESHOLD - 1 lookups a lot of times, once for each
of nnodes relations?

Also, how is 4/4 safe - this is outside of recovery too?

Smaller comment:

+static void
+FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber *forkNum, int nforks,
+ BlockNumber *nForkBlocks, BlockNumber *firstDelBlock)
...
+ /* Check that it is in the buffer pool. If not, do nothing. */
+ LWLockAcquire(bufPartitionLock, LW_SHARED);
+ buf_id = BufTableLookup(&bufTag, bufHash);
...
+ bufHdr = GetBufferDescriptor(buf_id);
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+ bufHdr->tag.forkNum == forkNum[i] &&
+ bufHdr->tag.blockNum >= firstDelBlock[i])
+ InvalidateBuffer(bufHdr); /* releases spinlock */
+ else
+ UnlockBufHdr(bufHdr, buf_state);
a

I'm a bit confused about the check here. We hold a buffer partition
lock, and have done a lookup in the mapping table. Why are we then
rechecking the relfilenode/fork/blocknum? And why are we doing so
holding the buffer header lock, which is essentially a spinlock, so
should only ever be held for very short portions?

This looks like it's copying logic from DropRelFileNodeBuffers() etc,
but there the situation is different: We haven't done a buffer mapping
lookup, and we don't hold a partition lock!

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2020-11-18 18:26:46 Re: cutting down the TODO list thread
Previous Message Andrew Dunstan 2020-11-18 18:01:09 Re: Allow matching whole DN from a client certificate