RE: [Patch] Optimize dropping of relation buffers using dlist

From: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>
To: 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "amit(dot)kapila16(at)gmail(dot)com" <amit(dot)kapila16(at)gmail(dot)com>, "tgl(at)sss(dot)pgh(dot)pa(dot)us" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>, "tomas(dot)vondra(at)2ndquadrant(dot)com" <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-10-05 01:29:07
Message-ID: OSBPR01MB234158410D48D1FE369F1C4AEF0C0@OSBPR01MB2341.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Friday, October 2, 2020 11:45 AM, Horiguchi-san wrote:

> Thaks for the new version.

Thank you for your thoughtful reviews!
I've attached an updated patch addressing the comments below.

1.
> The following description is found in the comment for FlushRelationBuffers.
>
> > * XXX currently it sequentially searches the buffer pool, should
> be
> > * changed to more clever ways of searching. This routine is
> not
> > * used in any performance-critical code paths, so it's not worth
> > * adding additional overhead to normal paths to make it go
> faster;
> > * but see also DropRelFileNodeBuffers.
>
> This looks like to me "We won't do that kind of optimization for
> FlushRelationBuffers, but DropRelFileNodeBuffers would need it". If so,
> don't we need to revise the comment together?

Yes, but instead of combining, I just removed the comment in FlushRelationBuffers that mentions
referring to DropRelFileNodeBuffers. I think it meant the same of using more clever ways of searching.
But that comment s not applicable anymore in DropRelFileNodeBuffers due to the optimization.
- * adding additional overhead to normal paths to make it go faster;
- * but see also DropRelFileNodeBuffers.
+ * adding additional overhead to normal paths to make it go faster.

2.
> - * XXX currently it sequentially searches the buffer pool, should be
> - * changed to more clever ways of searching. However, this routine
> - * is used only in code paths that aren't very performance-critical,
> - * and we shouldn't slow down the hot paths to make it faster ...

I revised and removed most parts of this code comment in the DropRelFileNodeBuffers
because isn't it the point of the optimization, to make the path faster for some performance
cases we've tackled in the thread?

3.
> This should no longer be a XXX comment.
Alright. I've fixed it.

4.
> It seems to me somewhat
> describing too-detailed at this function's level. How about something like the
> follwoing? (excpet its syntax, or phrasing:p)
> ====
> If we are expected to drop buffers less enough, we locate individual buffers
> using BufTableLookup. Otherwise we scan through all buffers. Since we
> mustn't leave a buffer behind, we take the latter way unless the sizes of all the
> involved forks are known to be accurte. See smgrcachednblocks() for details.
> ====

Sure. I paraphrased it like below.

If the expected maximum number of buffers to be dropped is small
enough, individual buffer is located by BufTableLookup(). Otherwise,
the buffer pool is sequentially scanned. Since buffers must not be
left behind, the latter way is executed unless the sizes of all the
involved forks are known to be accurate. See smgrcachednblocks() for
more details.

5.
> (I'm still mildly opposed to the function name, which seems exposing detail
> too much.)
I can't think of a better name, but smgrcachednblocks seems straightforward though.
Although I understand that it may be confused with the relation property smgr_cached_nblocks.
But isn't that what we're getting in the function?

6.
> + * Get the total number of cached blocks and to-be-invalidated
> blocks
> + * of the relation. If a fork's nblocks is not valid, break the loop.
>
> The number of file blocks is not usually equal to the number of existing
> buffers for the file. We might need to explain that limitation here.

I revised that comment like below..

Get the total number of cached blocks and to-be-invalidated blocks
of the relation. The cached value returned by smgrcachednblocks
could be smaller than the actual number of existing buffers of the
file. This is caused by buggy Linux kernels that might not have
accounted the recent write. If a fork's nblocks is invalid, exit loop.

7.
> + for (j = 0; j < nforks; j++)
>
> Though I understand that j is considered to be in a connection with fork
> number, I'm a bit uncomfortable that j is used for the outmost loop..

I agree. We must use I for the outer loop for consistency.

8.
> + for (curBlock = firstDelBlock[j]; curBlock <
> nTotalBlocks;
> +curBlock++)
>
> Mmm. We should compare curBlock with the number of blocks of the fork,
> not the total of all forks.

Oops. Yes. That should be nForkBlocks, so we have to call again smgrcachednblocks()
In the optimization loop for forks.

9.
> + uint32 newHash; /*hash value for newTag */
> + BufferTag newTag; /* identity of requested block */
> + LWLock *newPartitionLock; /* buffer partition lock for it */
>
> It seems to be copied from somewhere, but the buffer is not new at all.

Thanks for catching that. Yeah. Fixed.

10.
> + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> + bufHdr->tag.forkNum == forkNum[j] &&
> + bufHdr->tag.blockNum == curBlock)
> + InvalidateBuffer(bufHdr); /* releases spinlock */
>
> I think it cannot happen that the block is used for a different block of the
> same relation-fork, but it could be safer to check
> bufHdr->tag.blockNum >= firstDelBlock[j] instead.

Understood and that's fine with me. Updated.

11.
> + * smgrcachednblocks() -- Calculate the number of blocks that are
> cached in
> + * the supplied relation.
> + *
> + * It is equivalent to calling smgrnblocks, but only used in recovery
> +for now
> + * when DropRelFileNodeBuffers() is called. This ensures that only
> +cached value
> + * is used which is always valid in recovery, since there is no shared
> + * invalidation mechanism that is implemented yet for changes in file size.
> + *
> + * This returns an InvalidBlockNumber when smgr_cached_nblocks is not
> +available
> + * and when not in recovery.
>
> Isn't it too concrete? We need to mention the buggy-kernel issue here rahter
> than that of callers.
>
> And if the comment is correct, we should Assert(InRecovery) at the beggining
> of this function.

I did not add the assert because it causes the recovery tap test to fail.
However, I updated the function description like below.

It is equivalent to calling smgrnblocks, but only used in recovery for now.
The returned value of file size could be inaccurate because the lseek of buggy
Linux kernels might not have accounted the recent file extension or write.
However, this function ensures that cached values are only used in recovery,
since there is no shared invalidation mechanism that is implemented yet for
changes in file size.

This returns an InvalidBlockNumber when smgr_cached_nblocks is not available
and when not in recovery.

Thanks a lot for the reviews.
If there are any more comments, feedback, or points I might have missed please feel free to reply.

Best regards,
Kirk Jamison

Attachment Content-Type Size
v21-Optimize-DropRelFileNodeBuffers-during-recovery.patch application/octet-stream 9.1 KB
v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch application/octet-stream 1.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2020-10-05 01:38:32 Re: Two fsync related performance issues?
Previous Message Kyotaro Horiguchi 2020-10-05 01:25:08 Re: "cert" + clientcert=verify-ca in pg_hba.conf?