Re: POC: Cleaning up orphaned files using undo logs

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-20 15:43:41
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2019-08-20 09:08:29 -0400, Robert Haas wrote:
> On Sat, Aug 17, 2019 at 1:28 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > The primary one in the context here is that if we do *not* have to lock
> > the buffers all ahead of time, we can simplify the interface. We
> > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > we're doing right now, so we'd need another phase, called by the "user"
> > during undo insertion. But if we do not need to lock the buffers before
> > the insertion over all starts, the inserting location doesn't have to
> > care.
> >
> > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > time was imo fairly unconvincing. Following the "recipe" for WAL
> > insertions is a good idea when writing a new run-of-the-mill WAL
> > inserting location - but when writing a new fundamental facility, that
> > already needs to modify how WAL works, then I find that much less
> > convincing.
> So, you seem to be talking about something here which is different
> than what I thought we were talking about. One question is whether we
> need to lock all of the buffers "ahead of time," and I think the
> answer to that question is probably "no." Since nobody else can be
> writing to those buffers, and probably also nobody can be reading them
> except maybe for some debugging tool, it should be fine if we enter
> the critical section and then lock them at the point when we write the
> bytes. I mean, there shouldn't be any contention, and I don't see any
> other problems.

Right. As long as we are as restrictive about the number of buffers per
undo record, and number of records per WAL insertions, I don't see any
need to go further than that.

> > Well, in the version of code that I was reviewing here, I don't there is
> > such a limit (there is a limit for buffers per undo record, but no limit
> > on the number of records inserted together). I think Dilip added a limit
> > since. And we have the issue of a lot of IO happening while holding
> > content locks on several pages. So I don't think it's a straw man at
> > all.
> Hmm, what do you mean by "a lot of IO happening while holding content
> locks on several pages"? We might XLogInsert() but there shouldn't be
> any buffer I/O going on at that point.

That's my primary complain with how the code is structured right
now. Right now we potentially perform IO while holding exclusive content
locks, often multiple ones even. When acquiring target pages for undo,
currently always already hold the table page exclusive locked, and if
there's more than one buffer for undo, we'll also hold the previous
buffers locked. And acquiring a buffer will often have to write out a
dirty buffer to the OS, and a lot of times that will then also require
the kernel to flush data out. That's imo an absolute no-go for the
general case.

> If there is, I think that should be redesigned. We should collect
> buffer pins first, without locking. Then lock. Then write. Or maybe
> lock-and-write, but only after everything's pinned.

Right. It's easy enough to do that for the locks on undo pages
themselves. The harder part is the content lock on the "table page" - we
don't accurately know how many undo buffers we will need, without
holding the table lock (or preventing modifications in some other

I tried to outline the problem and potential solutions in more detail

> The fact of calling XLogInsert() while holding buffer locks is not
> great, but I don't think it's any worse here than in any other part of
> the system, because the undo buffers aren't going to be suffering
> concurrent access from any other backend, and because there shouldn't
> be more than a few of them.

Yea. That's obviously undesirable, but also fundamentally required at
least in the general case. And it's not at all specific to undo.

> [ WAL logging protocol ]

> I didn't intend to suggest that you don't want this to be documented.
> What I intended to suggest was that you seem to want to deviate from
> the documented rules, and it seems to me that we shouldn't do that
> unless we change the rules first, and I don't know what you think the
> rules should be or why those rules are safe.

IDK. We have at least five different places that at the very least bend
the rules - but with a comment explaining why it's safe in the specific
case. Personally I don't really think the generic guideline needs to
list every potential edge-case.

> > I think what primarily makes me concerned is that it's not clear to me
> > what guarantees that discard is the only reason for the block to
> > potentially be missing. I contrast to most other similar cases where WAL
> > replay simply re-creates the objects when trying to replay an action
> > affecting such an object, here we simply skip over the WAL logged
> > operation. So if e.g. the entire underlying UNDO file got lost, we
> > neither re-create it with valid content, nor error out. Which means we
> > got to be absolutely sure that all undo files are created in a
> > persistent manner, at their full size. And that there's no way that data
> > could get lost, without forcing us to perform REDO up to at least the
> > relevant point again.
> I think the crucial question for me here is the extent to which we're
> cross-checking against the discard pointer. If we're like, "oh, this
> undo data isn't on disk any more, it must've already been discarded,
> let's ignore the write," that doesn't sound particularly great,
> because files sometimes go missing.


> But, if we're like, "oh, we dirtied this undo buffer but now that undo
> has been discarded so we don't need to write the data back to the
> backing file," that seems fine. The discard pointer is a fully
> durable, WAL-logged thing; if it's somehow wrong, we have got huge
> problems anyway.

There is some cross-checking against the discard pointer while reading,
but it's not obvious for me that there is in all places. In particularly
for insertions. UndoGetBufferSlot() itself doesn't have a crosscheck
afaict, and I don't see anything in InsertPreparedUndo() either. It's
possible that somehow it's indirectly guaranteed, but if so it'd be far
from obvious enough.

> > While it appears that we always WAL log the undo extension, I am not
> > convinced the recovery interlock is strong enough. For one
> > UndoLogDiscard() unlinks segments before WAL logging their removal -
> > which means if we crash after unlink() and before the
> > XLogInsert(XLOG_UNDOLOG_DISCARD) we'd theoretically be in trouble (in
> > practice we might be fine, because there ought to be nobody still
> > referencing that UNDO - but I don't think that's actually guaranteed as
> > is).
> Hmm, that sounds a little worrying. I think there are two options
> here: unlike what we do with buffers, where we can use buffer locking
> etc. to make the insertion of the WAL record effectively simultaneous
> with the changes to the data pages, the removal of old undo files has
> to happen either before or after XLogInsert(). I think "after" would
> be better.


> > Nor do I see where we're updating minRecoveryLocation when
> > replaying a XLOG_UNDOLOG_DISCARD, which means that a restart during
> > recovery could be stopped before the discard has been replayed, leaving
> > us with wrong UNDO, but allowing write acess. Seems we'd at least need a
> > few more XLogFlush() calls.
> That sounds like a problem, but it seems like it might be better to
> make sure that minRecoveryLocation gets bumped, rather than adding
> XLogFlush() calls.

XLogFlush() so far is the way to update minRecoveryLocation:

* During REDO, we are reading not writing WAL. Therefore, instead of
* trying to flush the WAL, we should update minRecoveryPoint instead. We
* test XLogInsertAllowed(), not InRecovery, because we need checkpointer
* to act this way too, and because when it tries to write the
* end-of-recovery checkpoint, it should indeed flush.
if (!XLogInsertAllowed())
UpdateMinRecoveryPoint(record, false);

I don't think there's currently any other interface available to redo
functions to update minRecoveryLocation. And we already use XLogFlush()
for that purpose in numerous redo routines.


Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2019-08-20 16:06:06 Re: Global temporary tables
Previous Message Sergei Kornilov 2019-08-20 15:20:10 Re: Change ereport level for QueuePartitionConstraintValidation