Re: POC: Cleaning up orphaned files using undo logs

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Antonin Houska <ah(at)cybertec(dot)at>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2020-11-13 05:11:35
Message-ID: CA+hUKGJL4X1em70rxN1d_EC3rxiVhVd1woHviydW=Hr2PeGBpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 12, 2020 at 10:15 PM Antonin Houska <ah(at)cybertec(dot)at> wrote:
> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > Thanks. We decided to redesign a couple of aspects of the undo
> > storage and record layers that this patch was intended to demonstrate,
> > and work on that is underway. More on that soon.
>
> As my boss expressed in his recent blog post, we'd like to contribute to the
> zheap development, and a couple of developers from other companies are
> interested in this as well. Amit Kapila suggested that the "cleanup of
> orphaned files" feature is a good start point in getting the code into PG
> core, so I've spent some time on it and tried to rebase the patch set.

Hi Antonin,

I saw that -- great news! -- and have been meaning to write for a
while. I think I am nearly ready to talk about it again. I agree
100% that it's worth trying to do something much simpler than a new
access manager, and this was the simplest useful feature solving a
real-world-problem-that-people-actually-have we could come up with
(based on an idea from Robert). I think it needs a convincing
explanation for why there is no scenario where the relfilenode is
recycled for a new unlucky table before the rollback is executed,
which might depend on details that you might be working on/changing
(scenarios where you execute undo twice because you forgot you already
did it).

> In fact what I did is not mere rebasing against the current master branch -
> I've also (besides various bug fixes) done some design changes.
>
> Incorporated the new Undo Record Set (URS) infrastructure
> ---------------------------------------------------------
>
> This is also pointed out in [0].
>
> I started from [1] and tried to implement some missing parts (e.g. proper
> closing of the URSs after crash), introduced UNDO_DEBUG preprocessor macro
> which makes the undo log segments very small and fixed some bugs that the
> small segments exposed.

Cool! Getting up to speed on all these made up concepts like URS, and
getting all these pieces assembled and rebased and up and running is
already quite something, let alone adding missing parts and debugging.

> The most significant change I've done was removal of the undo requests from
> checkpoint. I could not find any particular bug / race conditions related to
> including the requests into the checkpoint, but I concluded that it's easier
> to think about consistency and checkpoint timings if we scan the undo log on
> restart (after recovery has finished) and create the requests from scratch.

Interesting. I guess that would be closer to textbook three-phase ARIES.

> [2] shows where I ended up before I started to rebase this patchset.
>
> No background undo
> ------------------
>
> Reduced complexity of the patch seems to be the priority at the moment. Amit
> suggested that cleanup of an orphaned relation file is simple enough to be
> done on foreground and I agree.
>
> "undo worker" is still there, but it only processes undo requests after server
> restart because relation data can only be changed in a transaction - it seems
> cleaner to launch a background worker for this than to hack the startup
> process.

I suppose the simplest useful system would be one does the work at
startup before allowing connections, and also in regular backends, and
panics if a backend ever exits while it has pending undo (panic =
"goto crash recovery"). Then you don't have to deal with undo workers
running at the same time as regular sessions which might run into
trouble reacquiring locks (for an AM I mean), or due to OIDs being
recycled with multiple checkpoints, or undo work that gets deferred
until the next restart of the server.

> Since the concept of undo requests is closely related to the undo worker, I
> removed undorequest.c too. The new (much simpler) undo worker gets the
> information on incomplete / aborted transactions from the undo log as
> mentioned above.
>
> SMGR enhancement
> ----------------
>
> I used the 0001 patch from [3] rather than [4], although it's more invasive
> because I noticed somewhere in the discussion that there should be no reserved
> database OID for the undo log. (InvalidOid cannot be used because it's already
> in use for shared catalogs.)

I give up thinking about the colour of the BufferTag shed and went
back to magic database 9, mainly because there seemed to be more
pressing matters. I don't even think it's that crazy to store this
type of system-wide data in pseudo databases, and I know of other
systems that do similar sorts of things without blinking...

> Following are a few areas which are not implemented yet because more
> discussion is needed there:

Hmm. I'm thinking about these questions.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2020-11-13 05:27:34 Re: Add docs stub for recovery.conf
Previous Message Kyotaro Horiguchi 2020-11-13 04:22:01 Re: In-placre persistance change of a relation