Re: POC: Cleaning up orphaned files using undo logs

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-07-10 16:36:14
Message-ID: CA+TgmoZ8AOC0h=j9WEkOV_FVZFW54Q00aVDXMZTJp9jeWZNv=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 10, 2019 at 2:32 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> As of now, after we finish executing the rollback actions, the entry
> from the hash table is removed. Now, at a later time (when queues are
> full and we want to insert a new entry) when we access the queue entry
> (to check whether we can remove it) corresponding to the removed hash
> table entry, will it be safe to access it? The hash table entry might
> have been freed or would have been reused as some other entry by the
> time we try to access it.

Hmm, yeah, that's a problem. I think we could possibly fix this by
having the binary heaps just store a FullTransactionId rather than a
pointer to the RollBackHashEntry. Then, if you get a
FullTransactionId from the binary heap, you just do a hash table
lookup to find the RollBackHashEntry instead of accessing it directly.
If it doesn't exist, then you can just discard the entry: it's for
some old transaction that's no longer relevant.

However, there are a few problems with that idea. One is that I see
that you've made the hash table keyed by full_xid + start_urec_ptr
rather than just full_xid, so if the queues just point to an XID, it's
not enough to find the hash table entry. The comment claims that this
is necessary because "in the same transaction, there could be rollback
requests for both logged and unlogged relations," but I don't
understand why that means we need start_urec_ptr in the hash table
key. It would seem more natural to me to have a single entry that
covers both the logged and the unlogged undo for that transaction.

(Incidentally, I don't think it's correct that RollbackHashEntry
starts with FullTransactionId full_xid + UndoRecPtr start_uprec_ptr
declared separately; I think it should start with RollbackHashKey -
although if we change the key back to just a FullTransactionId then we
don't need to worry separately about fixing this issue.)

Another problem is that on a 64-bit system, we can pass a
FullTransactionId by value, but on a 32-bit system we can't. That's
awkward, because if we can't pass the XID by value, then we're back to
needing a separately-allocated structure for the queue entries, which
I was really hoping to avoid.

A second possible approach to this problem is to just reset all the
binary heaps (using binaryheap_reset) whenever we insert a new entry
into the hash table, and rebuild them the next time they're needed by
reinserting all of the current entries in the hash table. That might
be too inefficient. You can insert a bunch of things in a row without
re-heaping, and you can dequeue a bunch of things in a row without
re-heaping, but if they alternate you'll re-heap a lot. I don't know
whether that costs enough to worry about; it might be fine.

A third possible approach is to allocate a separate array whose
entries are reused, and to maintain a freelist of entries from that
array. All the real data is stored in this array, and the binary
heaps and hash table entries just point to it. When the freelist is
empty, the next allocate scans all the binary heaps and removes any
pointers to inactive entries; it then puts all inactive entries back
onto the freelist. This is more complex than the previous approach,
and it doesn't totally avoid re-heaping, because removing pointers to
inactive entries from the binary heaps will necessitate a re-heap on
next access. However, if the total capacity of the data structures is
large compared to the number of entries actually in use, which will
usually be true, we'll have to re-heap much less often, because we
only have to do it when the number of allocations exhausts
*everything* on the free-list, rather than after every allocation.

A fourth possible approach is to enhance the simplehash mechanism to
allow us to do cleanup when an item to which there might still be
residual pointers is reused. We could allow some code supplied by the
definer of an individual simplehash implementation to be executed
inside SH_INSERT, just at the point where we're going to make an entry
status SH_STATUS_IN_USE. What we'd do is add a flag to the structure
indicating whether there might be deferred cleanup work for that
entry. Maybe it would be called something like 'bool processed' and
set when we process the undo work for that entry. If, when we're
about to reuse an entry, that flag is set, then we go scan all the
binary heaps and remove all entries for which that flag is set. And
then we unset the flag for all of those entries. Like the previous
approach, this is basically a refinement of the second approach in
that it tries to avoid re-heaping too often. Here, instead of
re-heaping once we've been through the entire free-list, we'll re-heap
when we happen (more or less randomly) happen to reuse a hash table
entry that's been reused, but we avoid it when we happen to snag a
hash table entry that hasn't been reused recently. This is probably
less efficient at avoiding re-heaping than the previous approach, but
it avoids a separately-allocated data structure, which is nice.

Broadly, you are correct to point out that you need to avoid chasing
stale pointers, and there are a bunch of ways to accomplish that:
approach #1 avoids using real pointers, and the rest just make sure
that any stale pointers don't stick around long enough to cause any
harm. There are probably also several other totally realistic
alternatives, and I don't know for sure what is best, or how much it
matters.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Binguo Bao 2019-07-10 16:39:05 Re: [proposal] de-TOAST'ing using a iterator
Previous Message Bruce Momjian 2019-07-10 16:26:24 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)