RE: [Patch] Optimize dropping of relation buffers using dlist

From: "k(dot)jamison(at)fujitsu(dot)com" <k(dot)jamison(at)fujitsu(dot)com>
To: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "tgl(at)sss(dot)pgh(dot)pa(dot)us" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>, "tomas(dot)vondra(at)2ndquadrant(dot)com" <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-12-22 06:25:25
Message-ID: OSBPR01MB234113FD8DDAEBCC2317DADBEFDF0@OSBPR01MB2341.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Monday, December 21, 2020 10:25 PM, Amit Kapila wrote:
> I have started doing minor edits to the patch especially planning to write a
> theory why is this optimization safe and here is what I can come up with:
> "To
> remove all the pages of the specified relation forks from the buffer pool, we
> need to scan the entire buffer pool but we can optimize it by finding the
> buffers from BufMapping table provided we know the exact size of each fork
> of the relation. The exact size is required to ensure that we don't leave any
> buffer for the relation being dropped as otherwise the background writer or
> checkpointer can lead to a PANIC error while flushing buffers corresponding
> to files that don't exist.
>
> To know the exact size, we rely on the size cached for each fork by us during
> recovery which limits the optimization to recovery and on standbys but we
> can easily extend it once we have shared cache for relation size.
>
> In recovery, we cache the value returned by the first lseek(SEEK_END) and
> the future writes keeps the cached value up-to-date. See smgrextend. It is
> possible that the value of the first lseek is smaller than the actual number of
> existing blocks in the file due to buggy Linux kernels that might not have
> accounted for the recent write. But that should be fine because there must
> not be any buffers after that file size.
>
> XXX We would make the extra lseek call for the unoptimized paths but that is
> okay because we do it just for the first fork and we anyway have to scan the
> entire buffer pool the cost of which is so high that the extra lseek call won't
> make any visible difference. However, we can use InRecovery flag to avoid the
> additional cost but that doesn't seem worth it."
>
> Thoughts?

+1
Thank you very much for expanding the comments to carefully explain the
reason on why the optimization is safe. I was also struggling to explain it completely
but your description also covers the possibility of extending the optimization in the
future once we have shared cache for rel size. So I like this addition.

(Also, it seems that we have concluded to retain the locking mechanism of the
existing patch based from the recent email exchanges. Both the traditional path and
the optimized path do the rechecking. So there seems to be no problem, I'm definitely
fine with it.)

Regards,
Kirk Jamison

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-12-22 06:47:57 Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Previous Message Justin Pryzby 2020-12-22 04:11:53 doc review for v14