Re: POC: Cleaning up orphaned files using undo logs

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-20 09:02:18
Message-ID: CA+hUKG+MpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Aside from code changes based on review (and I have more to come of
those), the attached experimental patchset (also at
https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol
that, I hope, allows for better concurrency, reliability and
readability, and removes a bunch of TODO notes about questionable
interlocking. However, I'm not quite done figuring out if the bufmgr
interaction is right and will be manageable on the undoaccess side, so
I'm hoping to get some feedback, not asking for anyone to rebase on
top of it yet.

Previously, there were two LWLocks used to make sure that all
discarding was prevented while anyone was reading or writing data in
any part of an undo log, and (probably more problematically) vice
versa. Here's a new approach that removes that blocking:

1. Anyone is allowed to try to read or write data at any UndoRecPtr
that has been allocated, through the buffer pool (though you'd usually
want to check it with UndoRecPtrIsDiscarded() first, and only rely on
the system I'm describing to deal with races).

2. ReadBuffer() might return InvalidBuffer. This can happen for a
cache miss, if the smgrread implementation wants to indicate that the
buffer has been discarded/truncated and that is expected (md.c won't
ever do that, but undofile.c can).

3. UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
unpinned buffers, and marks as BM_DISCARDED any that happen to be
pinned right now, so they can't be immediately invalidated. Such
buffers are never written back and are eligible for reuse on the next
clock sweep, even if they're written into by a backend that managed to
do that when we were trying to discard.

4. In order to make this work, I needed to track an extra offset
'begin' that says what physical storage exists. So [begin, end) give
you the range of physical undo space (that is, files that exist on
disk) and [discard, insert) give you the range of active data within
it. There are now four offsets per log in shm and in the
pg_stat_undo_logs view.

5. Separating begin from discard allows the WAL logging for
UndoLogDiscard() to do filesystem actions before logging, and other
effects after logging, which have several nice properties if you work
through the various crash scenarios.

This allowed a lot of direct UndoLogSlot access and locking code to be
removed from undodiscard.c and undoaccess.c, because now they can just
proceed as normal, as long as they are prepared to give up whenever
the buffer manager tells them the buffer they're asking for has
evaporated. Once they've pinned a buffer, they don't need to care if
it becomes (or already was) BM_DISCARDED; attempts to dirty it will be
silently ignore and eventually it'll be reclaimed. It also gets rid
of 'oldest_data', which was another scheme tagging along behind the
discard pointer.

So now I'd like to get feedback on the sanity of this scheme. I'm not
saying it doesn't have bugs right now -- I've been trying to figure
out good ways to test it and I'm not quite there yet -- but the
concept. One observation I have is that there were already code paths
in undoaccess.c that can tolerate InvalidBuffer in recovery, due to
the potentially different discard timing for DO vs REDO. I think
that's a point in favour of this scheme, but I can see that it's
inconvenient to have to deal with InvalidBuffer whenever you read.

Some other changes in this patch set:

1. There is a superuser-only procedure pg_force_discard_undo(logno)
that can discard on command. This can be used to get a system
unwedged if rollback actions are failing. For example, if you insert
an elog(ERROR, "boo!") into the smgr_undo() and then roll back a table
creation, you'll see a discard worker repeatedly reporting the error,
and pg_stat_undo_logs will show that the undo log space never gets
freed. This can be fixed with CALL pg_force_discard_undo(<logno>).

2. There is a superuser-only testing-only procedure
pg_force_switch_undo(logno) that can be used to force a transaction is
currently writing to that log number to switch to a new one, as if it
hit the end of the undo log (the 1TB address space within each undo
log). This is good for exercising code that eg rolls back stuff
spread over two undo logs.

3. When I was removing oldest_data from UndoLogSlot, I started
wondering why wait_fxmin was in there, as it was almost the last
reason why discard worker code needed to know about slots. Since we
currently have only a single discard worker, and no facility for
coordinating more than one discard worker, I think its bookkeeping
might as well be backend local. Here I made the stupidest change that
would work: a hash table to hold per-logno wait_fxmin. I'm not
entirely sure what data structure we really want for this -- it's all
a bit brute force right now. Thoughts?

I pulled in the latest code from undoprocessing as of today, and I
might be a bit confused about "Defect and enhancement in multi-log
support" some of which I have squashed into the make undolog patch.
BTW undoprocessing builds with initialized variable warnings in xact.c
on clang today.

--
Thomas Munro
https://enterprisedb.com

Attachment Content-Type Size
undo-20190820.tgz application/x-gzip 171.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-08-20 10:15:45 Re: Fixing typos and inconsistencies
Previous Message Heikki Linnakangas 2019-08-20 08:59:55 Re: understand the pg locks in in an simple case