From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
Cc: | "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, k(dot)jamison(at)fujitsu(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [Patch] Optimize dropping of relation buffers using dlist |
Date: | 2020-10-22 09:54:59 |
Message-ID: | CAA4eK1K8++A=zws=58MyH84gf3DZg8Rs=-a4QLZCHM7f9b3MsQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Oct 22, 2020 at 2:20 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> At Thu, 22 Oct 2020 07:31:55 +0000, "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com> wrote in
> > From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> > > On Thu, Oct 22, 2020 at 7:33 PM Kyotaro Horiguchi
> > > <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > > > Mmm. Not exact. The requirement here is that we must be certain that
> > > > the we don't have a buffuer for blocks after the file size known to
> > > > the process. While recoverying, If the first lseek() returned smaller
> > > > size than actual, we cannot have a buffer for the blocks after the
> > > > size. After we trncated or extended the file, we are certain that we
> > > > don't have a buffer for unknown blocks.
> > >
> > > Thanks, I understand now. Something feels fragile about it, perhaps
> > > because it's not really acting as a "cache" anymore despite its name,
> > > but I see the logic now. It becomes the authoritative source of
> > > information, even if the kernel decides to make our file smaller
> > > asynchronously.
I understand your hesitation but I guess if we can't rely on this
cache in recovery then probably we have a problem without this patch
itself because the current relation extension (in ReadBuffer_common)
relies on the smgrnblocks. So, if the cache lies with us it will
overwrite some existing block.
> > Thank you Horiguchi-san, you are a savior! I was worried like the end of the world has come.
> >
> >
> > > I think a synchronised file size cache wouldn't be enough to use this
> > > trick outside the recovery process, because the initial value would
> > > come from a call to lseek(), but unlike recovery, that wouldn't happen
> > > *before* we start putting pages in the buffer pool.
This is true because the other sessions might have pulled the page to
buffer pool but I think if we have invalidations for
extension/truncation of a relation then probably before relying on
this value we can process the invalidations to update this cache
value.
> > > Also, if we one
> > > day have a size-limited relcache, even recovery could get into
> > > trouble, if it evicts the RelationData that holds the authoritative
> > > nblocks value.
> >
> > That's too bad, because we hoped we may be able to various operations during normal operation (TRUNCATE, DROP TABLE/INDEX, DROP DATABASE, etc.) An honest man can't believe the system call, that's a hell.
> >
> > I'm probably being silly, but can't we avoid the problem by using fstat() instead of lseek(SEEK_END)? Would they return the same value from the i-node?
> >
> > Or, can't we just try to do BufTableLookup() one block after what smgrnblocks() returns?
>
> Lossy smgrrelcache or relacache is not a hard obstacle. As the same
> with the case of !accurate, we just give up the optimized dropping if
> the relcache doesn't give the authoritative size.
>
I think detecting lossy cache is the key thing, probably it might not
be as straight forward as it is in recovery path.
> By the way, heap scan finds the size of target relation using
> smgrnblocks(). I'm not sure why we don't miss recently-extended pages
> on a heap-scan? It seems to be possible that concurrent checkpoint
> fsyncs relation files inbetween the extension and scanning and the
> scanning gets smaller size than it really is.
>
Yeah, I think that would be a problem but not as serious as in the
case we are trying to deal here.
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2020-10-22 10:21:11 | Move catalog toast table and index declarations |
Previous Message | John Naylor | 2020-10-22 09:50:52 | Re: speed up unicode decomposition and recomposition |