Quick Links

Re: POC: Cleaning up orphaned files using undo logs

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: POC: Cleaning up orphaned files using undo logs
Date:	2019-05-13 18:05:59
Message-ID:	CA+TgmoZNkkyOGOoQTsjWbptywHG58=YOn9XyBxLynq+JONo96g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, May 12, 2019 at 2:15 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I have removed some of the globals and also improved some comments.

I don't like the discard_lock very much. Perhaps it's OK, but I hope
that there are better alternatives. One problem with Thomas Munro
pointed out to me in off-list discussion is that the discard_lock has
to be held by anyone reading undo even if the undo they are reading
and the undo that the discard worker wants to discard are in
completely different parts of the undo log. Somebody could be trying
to read an undo page written 1 second ago while the discard worker is
trying to discard an undo page written to the same undo log 1 hour
ago. Those things need not block each other, but with this design
they will. Another problem is that we end up holding it across an
I/O; there's precedent for that, but it's not particularly good
precedent. Let's see if we can do better.

My first idea was that we should just make this the caller's problem
instead of handling it in this layer. Undo is retained for committed
transactions until they are all-visible, and the reason for that is
that we presume that nobody can be interested in the data for MVCC
purposes unless there's a snapshot that can't see the results of the
transaction in question. Once the committed transaction is
all-visible, that's nobody, so it should be fine to just discard the
undo any time we like. That won't work with the existing zheap code,
which currently sometimes follows undo chains for transactions that
are all-visible, but I think that's a problem we should fix rather
than something we should force the undo layer to support. We'd still
need something kinda like the discard_lock for aborted transactions,
though, because as soon as you release the buffer lock on a table
page, the undo workers could apply all the undo to that page and then
discard it, and then you could afterwards try to look up the undo
pointer which you had retrieved from that page and stored in
backend-local memory. One thing we could probably do is make that a
heavyweight lock on the XID itself, so if you observe that an XID is
aborted, you have to go get this lock in ShareLock mode, then recheck
the page, and only then consult the undo; discarding the undo for an
aborted transaction would require AccessExclusiveLock on the XID.
This solution gets rid of the LWLock for committed undo; for aborted
undo, it avoids the false sharing and non-interruptibility that an
LWLock imposes.

But then I had what I think may be a better idea. Let's add a new
ReadBufferMode that suppresses the actual I/O; if the buffer is not
already present in shared_buffers, it allocates a buffer but returns
it without doing any I/O, so the caller must be prepared for BM_VALID
to be unset. I don't know what to call this, so I'll call it
RBM_ALLOCATE (leaving room for possible future variants like
RBM_ALLOCATE_AND_LOCK). Then, the protocol for reading an undo buffer
would go like this:

1. Read the buffer with RBM_ALLOCATE, thus acquiring a pin on the
relevant buffer.
2. Check whether the buffer precedes the discard horizon for that undo
log stored in shared memory.
3. If so, use the ForgetBuffer() code we have in the zheap branch to
deallocate the buffer and stop here. The undo is not available to be
read, whether it's still physically present or not.
4. Otherwise, if the buffer is not valid, call ReadBufferExtended
again, or some new function, to make it so. Remember to release all
of our pins.

The protocol for discarding an undo buffer would go like this:

1. Advance the discard horizon in shared memory.
2. Take a cleanup lock on each buffer that ought to be discarded.
Remember the dirty ones and forget the others.
3. WAL-log the discard operation.
4. Revisit the dirty buffers we remembered in step 2 and forget them.

The idea is that, once we've advanced the discard horizon in shared
memory, any readers that come along later are responsible for making
sure that they never do I/O on any older undo. They may create some
invalid buffers in shared memory, but they'll hopefully also get rid
of them if they do, and if they error out for some reason before doing
so, that buffer should age out naturally. So, the discard worker just
needs to worry about buffers that already exist. Once it's taken a
cleanup lock on each buffer, it knows that there are no I/O operations
and in fact no buffer usage of any kind still in progress from before
it moved the in-memory discard horizon. Anyone new that comes along
will clean up after themselves. We postpone forgetting dirty buffers
until after we've successfully WAL-logged the discard, in case we fail
to do so.

With this design, we don't add any new cases where a lock of any kind
must be held across an I/O, and there's also no false sharing.
Furthermore, unlike the previous proposal, this will work nicely with
something like old_snapshot_threshold. The previous design relies on
undo not getting discarded while anyone still cares about it, but
old_snapshot_threshold, if applied to zheap, would have the express
goal of discarding undo while somebody still cares about it. With
this design, we could support old_snapshot_threshold by having undo
readers error out in step #2 if the transaction is committed and not
visible to our snapshot but yet the undo is discarded. Heck, we can
do that anyway as a safety check, basically for free, and just tailor
the error message depending on whether old_snapshot_threshold is such
that the condition is expected to be possible.

While I'm kvetching, I can't help noticing that undoinsert.c contains
functions both for inserting undo and also for reading it, which seems
like a loose end that needs to be tied up somehow. I'm mildly
inclined to think that we should rename the file to something more
generic (e.g. undoaccess.h) rather than splitting it into two files
(e.g. undoinsert.c and undoread.c). Also, it looks to me like you
need to go through what is currently undoinsert.h and look for stuff
that can be made private to the .c file. I don't see why thing like
MAX_PREPARED_UNDO need to be exposed at all, and for things like
PreparedUndoSpace it seems like it would suffice to just do 'struct
PreparedUndoSpace; typedef struct PreparedUndoSpace
PreparedUndoSpace;' in the header and put the actual 'struct
PreparedUndoSpace { ... };' definition in the .c file. And
UnlockReleaseUndoBuffers has a declaration but no longer has a
definition, so I think that can go away too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: POC: Cleaning up orphaned files using undo logs at 2019-05-12 06:15:03 from Dilip Kumar

Responses

Re: POC: Cleaning up orphaned files using undo logs at 2019-05-15 10:28:30 from Amit Kapila
Re: POC: Cleaning up orphaned files using undo logs at 2019-05-15 14:44:18 from Dilip Kumar

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2019-05-13 18:15:10	Re: PANIC :Call AbortTransaction when transaction id is no normal
Previous Message	Tom Lane	2019-05-13 17:53:17	Re: SQL-spec incompatibilities in similar_escape() and related stuff