Re: [Patch] Optimize dropping of relation buffers using dlist

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, k(dot)jamison(at)fujitsu(dot)com, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)fujitsu(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [Patch] Optimize dropping of relation buffers using dlist
Date: 2020-11-10 04:29:47
Message-ID: CA+hUKG+SLCKMNByN6t-n1FWGvTJR2phMontJmmve94MHRSYMVg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 7, 2020 at 12:40 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I think one of the problems is returning fewer rows and that too
> without any warning or error, so maybe that is a bigger problem but we
> seem to be okay with it as that is already a known thing though I
> think that is not documented anywhere.

I'm not OK with it, and I'm not sure it's widely known or understood,
though I think we've made some progress in this thread. Perhaps, as a
separate project, we need to solve several related problems with a
shmem table of relation sizes from not-yet-synced files so that
smgrnblocks() is fast and always sees all preceding smgrextend()
calls. If we're going to need something like that anyway, and if we
can come up with a simple way to detect and report this type of
failure in the meantime, maybe this fast DROP project should just go
ahead and use the existing smgrnblocks() function without the weird
caching bandaid that only works in recovery?

> > The main argument I can think of against the idea of using plain old
> > smgrnblocks() is that the current error messages on smgrwrite()
> > failure for stray blocks would be indistinguishible from cases where
> > an external actor unlinked the file. I don't mind getting an error
> > that prevents checkpointing -- your system is in big trouble! -- but
> > it'd be nice to be able to detect that *we* unlinked the file,
> > implying the filesystem and bufferpool are out of sync, and spit out a
> > special diagnostic message. I suppose if it's the checkpointer doing
> > the writing, it could check if the relfilenode is on the
> > queued-up-for-delete-after-the-checkpoint list, and if so, it could
> > produce a different error message just for this edge case.
> > Unfortunately that's not a general solution, because any backend might
> > try to write a buffer out and they aren't synchronised with
> > checkpoints.
>
> Yeah, but I am not sure if we can consider manual (external actor)
> tinkering with the files the same as something that happened due to
> the database server relying on the wrong information.

Here's a rough idea I thought of to detect this case; I'm not sure if
it has holes. When unlinking a relation, currently we truncate
segment 0 and unlink all the rest of the segments, and tell the
checkpointer to unlink segment 0 after the next checkpoint. What if
we also renamed segment 0 to "$X.dropped" (to be unlinked by the
checkpointer), and taught GetNewRelFileNode() to also skip anything
for which "$X.dropped" exists? Then mdwrite() could use
_mdfd_getseg(EXTENSION_RETURN_NULL), and if it gets NULL (= no file),
then it checks if "$X.dropped" exists, and if so it knows that it is
trying to write a stray block from a dropped relation in the buffer
pool. Then we panic, or warn but drop the write. The point of the
renaming is that (1) mdwrite() for segment 0 will detect the missing
file (not just higher segments), (2) every backends can see that a
relation has been recently dropped, while also interlocking with the
checkpointer though buffer locks.

> One vague idea could be to develop pg_test_seek which can detect such
> problems but not sure if we can rely on such a tool to always give us
> the right answer. Were you able to consistently reproduce the lseek
> problem on the system where you have tried?

Yeah, I can reproduce that reliably, but it requires quite a bit of
set-up as root so it might be tricky to package up in easy to run
form. It might be quite nice to prepare an easy-to-use "gallery of
weird buffered I/O effects" project, including some of the
local-filesystem-with-fault-injection stuff that Craig Ringer and
others were testing with a couple of years ago, but maybe not in the
pg repo.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2020-11-10 04:30:13 Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Previous Message Michael Paquier 2020-11-10 04:28:09 Re: Refactor MD5 implementations and switch to EVP for OpenSSL