RE: [Patch] Optimize dropping of relation buffers using dlist

From: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>
To: 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "amit(dot)kapila16(at)gmail(dot)com" <amit(dot)kapila16(at)gmail(dot)com>, "tgl(at)sss(dot)pgh(dot)pa(dot)us" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "thomas(dot)munro(at)gmail(dot)com" <thomas(dot)munro(at)gmail(dot)com>, "robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>, "tomas(dot)vondra(at)2ndquadrant(dot)com" <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-12-03 03:49:27
Message-ID: OSBPR01MB23414F6F083D4D1546948A68EFF20@OSBPR01MB2341.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, November 26, 2020 4:19 PM, Horiguchi-san wrote:
> Hello, Kirk. Thank you for the new version.

Apologies for the delay, but attached are the updated versions to simplify the patches.
The changes reflected most of your comments/suggestions.

Summary of changes in the latest versions.
1. Updated the function description of DropRelFileNodeBuffers in 0003.
2. Updated the commit logs of 0003 and 0004.
3, FindAndDropRelFileNodeBuffers is now called for each relation fork,
instead of for all involved forks.
4. Removed the unnecessary palloc() and subscripts like forks[][],
firstDelBlock[], nforks, as advised by Horiguchi-san. The memory
allocation for block[][] was also simplified.
So 0004 became simpler and more readable.

> At Thu, 26 Nov 2020 03:04:10 +0000, "k(dot)jamison(at)fujitsu(dot)com"
> <k(dot)jamison(at)fujitsu(dot)com> wrote in
> > On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> > > From: Andres Freund <andres(at)anarazel(dot)de>
> > > > DropRelFileNodeBuffers() in recovery? The most common path is
> > > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers
> > > > (), which 3/4 doesn't address and 4/4 doesn't mention.
> > > >
> > > > 4/4 seems to address DropRelationFiles(), but only talks about
> > > TRUNCATE?
> > >
> > > Yes. DropRelationFiles() is used in the following two paths:
> > >
> > > [Replay of TRUNCATE during recovery]
> > > xact_redo_commit/abort() -> DropRelationFiles() ->
> > > smgrdounlinkall() ->
> > > DropRelFileNodesAllBuffers()
> > >
> > > [COMMIT/ROLLBACK PREPARED]
> > > FinishPreparedTransaction() -> DropRelationFiles() ->
> > > smgrdounlinkall()
> > > -> DropRelFileNodesAllBuffers()
> >
> > Yes. The concern is that it was not clear in the function descriptions
> > and commit logs what the optimizations for the
> > DropRelFileNodeBuffers() and DropRelFileNodesAllBuffers() are for. So
> > I revised only the function description of DropRelFileNodeBuffers() and the
> commit logs of the 0003-0004 patches. Please check if the brief descriptions
> would suffice.
>
> I read the commit message of 3/4. (Though this is not involved literally in the
> final commit.)
>
> > While recovery, when WAL files of XLOG_SMGR_TRUNCATE from vacuum
> or
> > autovacuum are replayed, the buffers are dropped when the sizes of all
> > involved forks of a relation are already "cached". We can get
>
> This sentence seems missing "dropped by (or using) what".
>
> > a reliable size of nblocks for supplied relation's fork at that time,
> > and it's safe because DropRelFileNodeBuffers() relies on the behavior
> > that cached nblocks will not be invalidated by file extension during
> > recovery. Otherwise, or if not in recovery, proceed to sequential
> > search of the whole buffer pool.
>
> This sentence seems involving confusion. It reads as if "we can rely on it
> because we're relying on it". And "the cached value won't be invalidated"
> doesn't explain the reason precisely. The reason I think is that the cached
> value is guaranteed to be the maximum page we have in shared buffer at least
> while recovery, and that guarantee is holded by not asking fseek once we
> cached the value.

Fixed the commit log of 0003.

> > > > I also don't get why 4/4 would be a good idea on its own. It uses
> > > > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > > > DropRelFileNodesAllBuffers() can be used for many relations at
> > > > once, this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD
> - 1
> > > lookups a lot
> > > > of times, once for each of nnodes relations?
> > >
> > > So, the threshold value should be compared with the total number of
> > > blocks of all target relations, not each relation. You seem to be right, got
> it.
> >
> > Fixed this in 0004 patch. Now we compare the total number of
> > buffers-to-be-invalidated For ALL relations to the
> BUF_DROP_FULL_SCAN_THRESHOLD.
>
> I didn't see the previous version, but the row of additional palloc/pfree's in
> this version looks uneasy.

Fixed this too.

> int i,
> + j,
> + *nforks,
> n = 0;
>
> Perhaps I think we don't define variable in different types at once.
> (I'm not sure about defining multple variables at once.)

Fixed this too.

> @@ -3110,7 +3125,10 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend
> *rnodes, int nnodes)
>
> DropRelFileNodeAllLocalBuffers(rnodes[i].node);
> }
> else
> + {
> + rels[n] = smgr_reln[i];
> nodes[n++] = rnodes[i].node;
> + }
> }
>
> We don't need to remember nodes and rnodes here since rnodes[n] is
> rels[n]->smgr_rnode here. Or we don't even need to store rels since we can
> scan the smgr_reln later again.
>
> nodes is needed in the full-scan path but it is enough to collect it after finding
> that we do full-scan.

I followed your advice and removed the rnodes[] and rels[].
nodes[] is allocated later at full scan path.

> + nforks = palloc(sizeof(int) * n);
> + forks = palloc(sizeof(ForkNumber *) * n);
> + blocks = palloc(sizeof(BlockNumber *) * n);
> + firstDelBlocks = palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM
> + 1));
> + for (i = 0; i < n; i++)
> + {
> + forks[i] = palloc(sizeof(ForkNumber) * (MAX_FORKNUM +
> 1));
> + blocks[i] = palloc(sizeof(BlockNumber) * (MAX_FORKNUM
> + 1));
> + }
>
> We can allocate the whole array at once like this.
>
> BlockNumber (*blocks)[MAX_FORKNUM+1] =
> (BlockNumber (*)[MAX_FORKNUM+1])
> palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1))

Thank you for suggesting to reduce the lines for the 2d dynamic memory alloc.
I followed this way in 0004, but it's my first time to see it written this way.
I am very glad it works, though is it okay to write it this way since I cannot find
a similar code of declaring and allocating 2D arrays like this in Postgres source code?

> + nBlocksToInvalidate += blocks[i][numForks];
> +
> + forks[i][numForks++] = j;
>
> We can signal to the later code the absense of a fork by setting
> InvalidBlockNumber to blocks. Thus forks[], nforks and numForks can be
> removed.

Followed it in 0004.

> + /* Zero the array of blocks because these will all be dropped anyway
> */
> + MemSet(firstDelBlocks, 0, sizeof(BlockNumber) * n *
> (MAX_FORKNUM +
> +1));
>
> We don't need to prepare nforks, forks and firstDelBlocks for all relations
> before looping over relations. In other words, we can fill in the arrays for a
> relation at every iteration of relations.

Followed your advice. Although I now drop the buffers per fork, which now
removes forks[][], nforks, firstDelBlocks[].

> + * We enter the optimization iff we are in recovery and the number of
> +blocks to
>
> This comment ticks out of 80 columns. (I'm not sure whether that convention
> is still valid..)

Fixed.

> + if (InRecovery && nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD)
>
> We don't need to check InRecovery here. DropRelFileNodeBuffers doesn't do
> that.

As for DropRelFileNodesAllBuffers use case, I used InRecovery
so that the optimization still works.
Horiguchi-san also wrote in another mail:
> A bit different from the point, but if some tuples have been inserted to the
> truncated table, XLogReadBufferExtended() is called for the table and the
> length is cached.
I was wrong in my previous claim that the "cached" value always return false.
When I checked the recovery test log from recovery tap test, there was only
one example when "cached" became true (script below) and entered the
optimization path. However, in all other cases including the TRUNCATE test case
in my patch, the "cached" flag returns "false".

"cached" flag became true:
# in different subtransaction patterns
$node->safe_psql(
'postgres', "
BEGIN;
CREATE TABLE spc_commit (id serial PRIMARY KEY, id2 int);
INSERT INTO spc_commit VALUES (DEFAULT, generate_series(1,3000));
TRUNCATE spc_commit;
SAVEPOINT s; ALTER TABLE spc_commit SET TABLESPACE other; RELEASE s;
COPY spc_commit FROM '$copy_file' DELIMITER ',';
COMMIT;");
$node->stop('immediate');
$node->start;

So I used the InRecovery for the optimization case of DropRelFileNodesAllBuffers.
I retained the smgrnblocks' "cached" parameter as it is useful in
DropRelFileNodeBuffers.

> > > I agree that we can do a better job by expanding comments to clearly
> > > state why it is safe.
> >
> > Yes, basically what Amit-san also mentioned above. The first patch
> prevents that.
> > And in the description of DropRelFileNodeBuffers in the 0003 patch,
> > please check If that would suffice.
>
> + * While in recovery, if the expected maximum number of
> buffers to be
> + * dropped is small enough and the sizes of all involved forks
> are
> + * already cached, individual buffer is located by
> BufTableLookup().
> + * It is safe because cached blocks will not be invalidated by file
> + * extension during recovery. See smgrnblocks() and
> smgrextend() for
> + * more details. Otherwise, if the conditions for optimization are
> not
> + * met, the buffer pool is sequentially scanned so that no
> buffers are
> + * left behind.
>
> I'm not confident on it, but it seems somewhat obscure. How about
> something like this?
>
> We mustn't leave a buffer for the relations to be dropped. We invalidate
> buffer blocks by locating using BufTableLookup() when we assure that we
> know up to what page of every fork we possiblly have a buffer for. We can
> know that by the "cached" flag returned by smgrblocks. It currently gets true
> only while recovery. See
> smgrnblocks() and smgrextend(). Otherwise we scan the whole buffer pool to
> find buffers for the relation, which is slower when a small part of buffers are
> to be dropped.

Followed your advice and modified it a bit.

I have changed the status to "Needs Review".
Feedbacks are always welcome.

Regards,
Kirk Jamison

Attachment Content-Type Size
v33-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch application/octet-stream 1.1 KB
v33-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch application/octet-stream 9.5 KB
v33-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch application/octet-stream 9.3 KB
v33-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch application/octet-stream 6.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2020-12-03 03:52:47 RE: Disable WAL logging to speed up data loading
Previous Message Hou, Zhijie 2020-12-03 03:32:46 RE: A new function to wait for the backend exit after termination