Re: [Patch] Optimize dropping of relation buffers using dlist

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, k(dot)jamison(at)fujitsu(dot)com, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-11-06 05:40:04
Message-ID: CA+hUKGKHu6C2iShHavhNcX=Vy=0h_jmuoi8=Eo9CJtEA9AydhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 6, 2020 at 5:09 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Fri, Nov 6, 2020 at 5:02 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > Here's a devil's advocate position I thought about: It's OK to leave
> > stray buffers (clean or dirty) in the buffer pool if files are
> > truncated underneath us by gremlins, as long as your system eventually
> > crashes before completing a checkpoint. The OID can't be recycled
> > until after a successful checkpoint, so the stray blocks can't be
> > confused with the blocks of another relation, and weird errors are
> > expected on a system that is in serious trouble. It's actually much
> > worse that we can give incorrect answers to queries when files are
> > truncated by gremlins (in the window of time before we presumably
> > crash because of EIO), because we're violating basic ACID principles
> > in user-visible ways. In this thread, discussion has focused on
> > availability (ie avoiding failures when trying to write back stray
> > buffers to a file that has been unlinked), but really a system that
> > can't see arbitrary committed transactions *shouldn't be available*.
> > This argument applies whether you think SEEK_END can only give weird
> > answers in the specific scenario I demonstrated with NFS, or whether
> > you think it's arbitrarily b0rked and reports random numbers: we
> > fundamentally can't tolerate that, so why are we trying to?
>
> It is not very clear to me how this argument applies to the patch
> in-discussion where we are relying on the cached value of blocks
> during recovery. I understand your point that we might skip scanning
> the pages and thus might not show some recently added data but that
> point is not linked with what we are trying to do with this patch.

It's an argument for giving up the hard-to-name cache trick completely
and going back to using unmodified smgrnblocks(), both in recovery and
online. If the only mechanism for unexpected file shrinkage is
writeback failure, then your system will be panicking soon enough
anyway -- so is it really that bad if there are potentially some other
weird errors logged some time before that? Maybe those errors will
even take the system down sooner, and maybe that's appropriate? If
there are other mechanisms for random file shrinkage that don't imply
a panic in your near future, then we have bigger problems that can't
be solved by any number of bandaids, at least not without
understanding the details of this hypothetical unknown failure mode.

The main argument I can think of against the idea of using plain old
smgrnblocks() is that the current error messages on smgrwrite()
failure for stray blocks would be indistinguishible from cases where
an external actor unlinked the file. I don't mind getting an error
that prevents checkpointing -- your system is in big trouble! -- but
it'd be nice to be able to detect that *we* unlinked the file,
implying the filesystem and bufferpool are out of sync, and spit out a
special diagnostic message. I suppose if it's the checkpointer doing
the writing, it could check if the relfilenode is on the
queued-up-for-delete-after-the-checkpoint list, and if so, it could
produce a different error message just for this edge case.
Unfortunately that's not a general solution, because any backend might
try to write a buffer out and they aren't synchronised with
checkpoints.

I'm not sure what the best approach is. It'd certainly be nice to be
able to drop small tables quickly online too, as a benefit of this
approach.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2020-11-06 06:18:48 Re: libpq compression
Previous Message Pavel Stehule 2020-11-06 05:36:18 Re: default result formats setting