Re: [Patch] Optimize dropping of relation buffers using dlist

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: k(dot)jamison(at)fujitsu(dot)com
Cc: tsunakawa(dot)takay(at)fujitsu(dot)com, andres(at)anarazel(dot)de, amit(dot)kapila16(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, thomas(dot)munro(at)gmail(dot)com, robertmhaas(at)gmail(dot)com, tomas(dot)vondra(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-12-04 05:04:27
Message-ID: 20201204.140427.2199799224811439733.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the new version.

This contains only replies. I'll send some further comments in another
mail later.

At Thu, 3 Dec 2020 03:49:27 +0000, "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com> wrote in
> On Thursday, November 26, 2020 4:19 PM, Horiguchi-san wrote:
> > Hello, Kirk. Thank you for the new version.
>
> Apologies for the delay, but attached are the updated versions to simplify the patches.
> The changes reflected most of your comments/suggestions.
>
> Summary of changes in the latest versions.
> 1. Updated the function description of DropRelFileNodeBuffers in 0003.
> 2. Updated the commit logs of 0003 and 0004.
> 3, FindAndDropRelFileNodeBuffers is now called for each relation fork,
> instead of for all involved forks.
> 4. Removed the unnecessary palloc() and subscripts like forks[][],
> firstDelBlock[], nforks, as advised by Horiguchi-san. The memory
> allocation for block[][] was also simplified.
> So 0004 became simpler and more readable.
...
> > > a reliable size of nblocks for supplied relation's fork at that time,
> > > and it's safe because DropRelFileNodeBuffers() relies on the behavior
> > > that cached nblocks will not be invalidated by file extension during
> > > recovery. Otherwise, or if not in recovery, proceed to sequential
> > > search of the whole buffer pool.
> >
> > This sentence seems involving confusion. It reads as if "we can rely on it
> > because we're relying on it". And "the cached value won't be invalidated"
> > doesn't explain the reason precisely. The reason I think is that the cached
> > value is guaranteed to be the maximum page we have in shared buffer at least
> > while recovery, and that guarantee is holded by not asking fseek once we
> > cached the value.
>
> Fixed the commit log of 0003.

Thanks!

...
> > + nforks = palloc(sizeof(int) * n);
> > + forks = palloc(sizeof(ForkNumber *) * n);
> > + blocks = palloc(sizeof(BlockNumber *) * n);
> > + firstDelBlocks = palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM
> > + 1));
> > + for (i = 0; i < n; i++)
> > + {
> > + forks[i] = palloc(sizeof(ForkNumber) * (MAX_FORKNUM +
> > 1));
> > + blocks[i] = palloc(sizeof(BlockNumber) * (MAX_FORKNUM
> > + 1));
> > + }
> >
> > We can allocate the whole array at once like this.
> >
> > BlockNumber (*blocks)[MAX_FORKNUM+1] =
> > (BlockNumber (*)[MAX_FORKNUM+1])
> > palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1))
>
> Thank you for suggesting to reduce the lines for the 2d dynamic memory alloc.
> I followed this way in 0004, but it's my first time to see it written this way.
> I am very glad it works, though is it okay to write it this way since I cannot find
> a similar code of declaring and allocating 2D arrays like this in Postgres source code?

Actually it would be somewhat novel for a certain portion of people,
but it is fundamentally the same with function pointers. Hard to make
it from scratch, but I suppose not so hard to read:)

int (*func_char_to_int)(char x) = some_func;

FWIW isn.c has the following part:

> static bool
> check_table(const char *(*TABLE)[2], const unsigned TABLE_index[10][2])

> > + nBlocksToInvalidate += blocks[i][numForks];
> > +
> > + forks[i][numForks++] = j;
> >
> > We can signal to the later code the absense of a fork by setting
> > InvalidBlockNumber to blocks. Thus forks[], nforks and numForks can be
> > removed.
>
> Followed it in 0004.

Looks fine to me, thanks.

> > + /* Zero the array of blocks because these will all be dropped anyway
> > */
> > + MemSet(firstDelBlocks, 0, sizeof(BlockNumber) * n *
> > (MAX_FORKNUM +
> > +1));
> >
> > We don't need to prepare nforks, forks and firstDelBlocks for all relations
> > before looping over relations. In other words, we can fill in the arrays for a
> > relation at every iteration of relations.
>
> Followed your advice. Although I now drop the buffers per fork, which now
> removes forks[][], nforks, firstDelBlocks[].

That's fine for me.

> > + * We enter the optimization iff we are in recovery and the number of
> > +blocks to
> >
> > This comment ticks out of 80 columns. (I'm not sure whether that convention
> > is still valid..)
>
> Fixed.
>
> > + if (InRecovery && nBlocksToInvalidate <
> > BUF_DROP_FULL_SCAN_THRESHOLD)
> >
> > We don't need to check InRecovery here. DropRelFileNodeBuffers doesn't do
> > that.
>
>
> As for DropRelFileNodesAllBuffers use case, I used InRecovery
> so that the optimization still works.
> Horiguchi-san also wrote in another mail:
> > A bit different from the point, but if some tuples have been inserted to the
> > truncated table, XLogReadBufferExtended() is called for the table and the
> > length is cached.
> I was wrong in my previous claim that the "cached" value always return false.
> When I checked the recovery test log from recovery tap test, there was only
> one example when "cached" became true (script below) and entered the
> optimization path. However, in all other cases including the TRUNCATE test case
> in my patch, the "cached" flag returns "false".

Yeah, I agree that smgrnblocks returns false in the targetted cases,
so we should want some amendment. We need to disucssion on this point.

> "cached" flag became true:
> # in different subtransaction patterns
> $node->safe_psql(
> 'postgres', "
> BEGIN;
> CREATE TABLE spc_commit (id serial PRIMARY KEY, id2 int);
> INSERT INTO spc_commit VALUES (DEFAULT, generate_series(1,3000));
> TRUNCATE spc_commit;
> SAVEPOINT s; ALTER TABLE spc_commit SET TABLESPACE other; RELEASE s;
> COPY spc_commit FROM '$copy_file' DELIMITER ',';
> COMMIT;");
> $node->stop('immediate');
> $node->start;
>
> So I used the InRecovery for the optimization case of DropRelFileNodesAllBuffers.
> I retained the smgrnblocks' "cached" parameter as it is useful in
> DropRelFileNodeBuffers.

I think that's ok as this version of the patch.

> > > > I agree that we can do a better job by expanding comments to clearly
> > > > state why it is safe.
> > >
> > > Yes, basically what Amit-san also mentioned above. The first patch
> > prevents that.
> > > And in the description of DropRelFileNodeBuffers in the 0003 patch,
> > > please check If that would suffice.
> >
> > + * While in recovery, if the expected maximum number of
> > buffers to be
> > + * dropped is small enough and the sizes of all involved forks
> > are
> > + * already cached, individual buffer is located by
> > BufTableLookup().
> > + * It is safe because cached blocks will not be invalidated by file
> > + * extension during recovery. See smgrnblocks() and
> > smgrextend() for
> > + * more details. Otherwise, if the conditions for optimization are
> > not
> > + * met, the buffer pool is sequentially scanned so that no
> > buffers are
> > + * left behind.
> >
> > I'm not confident on it, but it seems somewhat obscure. How about
> > something like this?
> >
> > We mustn't leave a buffer for the relations to be dropped. We invalidate
> > buffer blocks by locating using BufTableLookup() when we assure that we
> > know up to what page of every fork we possiblly have a buffer for. We can
> > know that by the "cached" flag returned by smgrblocks. It currently gets true
> > only while recovery. See
> > smgrnblocks() and smgrextend(). Otherwise we scan the whole buffer pool to
> > find buffers for the relation, which is slower when a small part of buffers are
> > to be dropped.
>
> Followed your advice and modified it a bit.
>
> I have changed the status to "Needs Review".
> Feedbacks are always welcome.
>
> Regards,
> Kirk Jamison

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-12-04 05:05:15 Re: Single transaction in the tablesync worker?
Previous Message Amit Kapila 2020-12-04 04:59:42 Re: Single transaction in the tablesync worker?