Re: POC: Cleaning up orphaned files using undo logs

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-07-05 20:16:51
Message-ID: CA+TgmoZ5g7UzMvM_42YMG8nbhOYpH+u5OMMnePJkYtT5HWotUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> [ new patches ]

I took a look at 0012 today, Amit's patch for extending the binary
heap machinery, and 0013, Amit's patch for "Infrastructure to register
and fetch undo action requests."

I don't think that binaryheap_allocate_shm() is a good design. It
presupposes that we want to store the binary heap as its own chunk of
shared memory allocated via ShmemInitStruct(), but we might want to do
something else, like embed in another structure, store it in a DSM or
DSA, etc., and this function can't do any of that. I think we should
have something more like:

extern Size binaryheap_size(int capacity);
extern void binaryheap_initialize(binaryheap *, int capacity,
binaryheap_comparator compare, void *arg);

Then the caller can do something like:

sz = binaryheap_size(capacity);
bh = ShmemInitStruct(name, sz, &found);
if (!found)
binaryheap_initialize(bh, capacity, comparator, whatever);

If it wants to get the memory in some other way, it just needs to
initialize bh differently; the rest is the same. Note that there is
no need, in this design, for binaryheap_size/initialize to make use of
"shared" memory. They could equally well be used on backend-local
memory. They do not need to care. You just provide the memory, and
they do their thing.

I wasn't very happy about binaryheap_nth(), binaryheap_remove_nth(),
and binaryheap_remove_nth_unordered() and started looking at how they
are used to try to see if there might be a better way. That led me to
look at 0013. Unfortunately, I find it really hard to understand what
this code is actually doing. There's a lot of redundant and
badly-written stuff in here. As a general principle, if you have two
or three data structures of some particular type, you don't write a
separate family of functions for manipulating each one. You write one
function for each operation, and you pass the particular copy of the
data structure with which you are working as an argument.

In the lengthy section of macros definitions at the top of
undorequest.c, we have macros InitXidQueue, XidQueueIsEmpty,
GetXidQueueSize, GetXidQueueElem, GetXidQueueTopElem,
GetXidQueueNthElem, and SetXidQueueElem. Several of these are used in
only one place or are not used anywhere at all; those should be
removed altogether and inlined into the single call site if there is
one. Now, then after this, there is a matching set of macros,
InitSizeQueue, SizeQueueIsEmpty, GetSizeQueueSize, GetSizeQueueElem,
GetSizeQueueTopElem, GetSizeQueueNthElem, and SetSizeQueueElem. Many
of these macros are exactly the same as the previous set of macros
except that they operate on a different queue, which as I mentioned in
the previous paragraph, is not a good design. It leads to extensive
code duplication.

Look, for example, at RemoveOldElemsFromSizeQueue and
RemoveOldElemsFromXidQueue. They are basically identical except for
s/Size/Xid/g and s/SIZE/XID/g, but you can't unify them easily because
they are calling different functions. However, if you didn't have one
function called GetSizeQueueSize and another called GetXidQueueSize,
but just had a pointer to the relevant binary heap, then both
functions could just call binaryheap_empty() on it, which would be
better style, use fewer macros, generate less machine code, and be
easier to read. Ideally, you'd get to the point where you could just
have one function rather than two, and pass the queue upon which it
should operate as an argument. There seems to be a good deal of this
kind of duplication in this file and it really needs to be cleaned up.

Now, one objection to the above line of attack is the different queues
actually contain different types of elements. Apparently, the XID
queue contains elements of type UndoXidQueue and the size queue
contains elements of type UndoSizeQueue. It is worth noting here that
these are bad type names, because they sound like they are describing
a type of queue, but it seems that they are actually describing an
element in the queue. However, there are two larger problems:

1. I don't think we should have three different kinds of objects for
each of the three different queues. It seems like it would be much
simpler and easier to just have one kind of object that stores all the
information we need (full_xid, start_urec_ptr, dbid, request_size,
next_retry_at, error_ocurred_at) and use that everywhere. You could
object that this would increase the storage space requirement, but it
wouldn't be enough to make any real difference and it probably would
be well worth it for the avoidance of complexity.

2. However, I don't think we should have a separate request object for
each queue anyway. We should insert pointers to the same objects in
all the relevant queue (either size + XID, or else error). So instead
of having three sets of objects, one for each queue, we'd just have
one set of objects and point to them with as many as two pointers.
We'd therefore need LESS memory than we're using today, because we
wouldn't have separate arrays for XID, size, and error queue elements.

In fact, it seems to me that we shouldn't have any such thing as
"queue entries" at all. The queues should just be pointing to
RollbackHashEntry *, and we should add all the fields there that are
present in any of the "queue entry" structures. This would use less
memory still.

I also think we should be using simplehash rather than dynahash. I'm
not sure that I would really say that simplehash is "simple," but it
does have a nicer API and simpler memory management. There's just a
big old block of memory, and there's no incremental allocation. That
keeps things simple for the code that wants to go through the queues
and removing dangling pointers. I think that the way this should work
is that each RollbackHashEntry * should contain a field "bool active."
Then:

1. When we pull an item out of one of the binary heaps, we check the
active flag. If it's clear, we ignore the entry and pull the next
item. If it's set, we clear the flag and process the item, so that
if it's subsequently pulled from the other queue it will be ignored.

2. If a binary heap is full when we need to insert into it, we can
iterate over all of the elements and throw away any that are !active.
They've already been dequeued and processed from some other queue, so
they're not "really" in this queue any more, even though we haven't
gone to the trouble of actually kicking them out yet.

On another note, UNDO_PEEK_DEPTH is bogus. It's used in UndoGetWork()
and it passes the depth argument down to GetRollbackHashKeyFromQueue,
which then does binaryheap_nth() on the relevant queue. Note that
this function is another places that is ends up duplicating code
because of the questionable decision to have separate types of queue
entries for each different queue; otherwise, it could probably just
take the binary heap into which it's peeking as an argument instead of
having three different cases. But that's not the main point here. The
main point is that it calls a function for whichever type of queue
we've got and gets some kind of queue entry using binaryheap_nth().
But binaryheap_nth(whatever, 2) does not give you the third-smallest
element in the binary heap. It gives you the third entry in the
array, which may or may not have the heap property, but even if it
does, the third element could be huge. Consider this binary heap:

0 1 100000 2 3 100001 100002 4 5 6 7 100003 100004 100005 100006

This satisfies the binary heap property, because the element at
position n is always smaller than the elements at positions 2n+1 and
2n+2 (assuming 0-based indexing). But if you want to look at the
smallest three elements in the heap, you can't just look at indexes
0..2. The second-smallest element must be at index 1 or 2, but it
could be either place. The third-smallest element could be the other
of 1 and 2, or it could be either child of the smaller one, so there
are three places it might be. In general, a binary heap is not a good
data structure for finding the smallest N elements of a collection
unless N is 1, and what's going to happen with what you've got here is
that we'll sometimes prioritize an item that would not have been
pulled from the queue for a long time over one that would have
otherwise been processed much sooner. I'm not sure that's a
show-stopper, but it doesn't seem good, and the current patch doesn't
seem to have any comments justifying it, or at least not in the places
nearby to where this is actually happening.

I think there are more problems here, too. Let's suppose that we
fixed the problem described in the previous paragraph somehow, or
decided that it won't actually make a big difference and just ignored
it. Suppose further that we have N active databases which are
generating undo requests. Luckily, we happen to also have N undo
workers available, and let's suppose that as of a certain moment in
time there is exactly one worker in each database. Think about what
will happen when one of those workers goes to look for the next undo
request. It's likely that the first request in the queue will be for
some other database, so it's probably going to have to peak ahead to
find a request for the database to which it's connected -- let's just
assume that there is one. How far will it have to peak ahead? Well,
if the requests are uniformly distributed across databases, each
request has a 1-in-N chance of being the right one. I wrote a little
Perl program to estimate the probability that we won't find the next
request for our databases within 10 requests as a function of the
number of databases:

1 databases => failure chance with 10 lookahead is 0.00%
2 databases => failure chance with 10 lookahead is 0.10%
3 databases => failure chance with 10 lookahead is 1.74%
4 databases => failure chance with 10 lookahead is 5.66%
5 databases => failure chance with 10 lookahead is 10.74%
6 databases => failure chance with 10 lookahead is 16.18%
7 databases => failure chance with 10 lookahead is 21.45%
8 databases => failure chance with 10 lookahead is 26.31%
9 databases => failure chance with 10 lookahead is 30.79%
10 databases => failure chance with 10 lookahead is 34.91%
11 databases => failure chance with 10 lookahead is 38.58%
12 databases => failure chance with 10 lookahead is 41.85%
13 databases => failure chance with 10 lookahead is 44.91%
14 databases => failure chance with 10 lookahead is 47.69%
15 databases => failure chance with 10 lookahead is 50.12%
16 databases => failure chance with 10 lookahead is 52.34%
17 databases => failure chance with 10 lookahead is 54.53%
18 databases => failure chance with 10 lookahead is 56.39%
19 databases => failure chance with 10 lookahead is 58.18%
20 databases => failure chance with 10 lookahead is 59.86%

Assuming my script (attached) doesn't have a bug, with only 8
databases, there's better than a 1-in-4 chance that we'll fail to find
the next entry for the current database within the lookahead window.
That's bad, because then the worker will be sitting around waiting
when it should be doing stuff. Maybe it will even exit, even though
there's work to be done, and even though all the other databases have
their own workers already. You can construct way worse examples than
this one, too: imagine that there are two databases, each with a
worker, and one has 99% of the requests and the other one has 1% of
the requests. It's really unlikely that there's going to be an entry
for the second database within the lookahead window. And note that
increasing the window doesn't really help either: you just need more
databases than the size of the lookahead window, or even almost as
many as the lookahead window, and things are going to stop working
properly.

On the other hand, suppose that you have 10 databases and one undo
worker. One database is pretty active and generates a continuous
stream of undo requests at exactly the same speed we can process them.
The others all have 1 pending undo request. Now, what's going to
happen is that you'll always find the undo request for the current
database within the lookahead window. So, you'll never exit. But
that means the undo requests in the other 9 databases will just sit
there for all eternity, because there's no other worker to process
them. On the other hand, if you had 11 databases, there's a good
chance it would work fine, because the new request for the active
database would likely be outside the lookahead window, and so you'd
find no work to do and exit, allowing a worker to be started up in
some other database. It would in turn exit and so on and you'd clear
the backlog for the other databases at least for a while, until you
picked the active database again. Actually, I haven't looked at the
whole patch set, so perhaps there is some solution to this problem
contemplated somewhere, but I consider this argument to be pretty good
evidence that a fixed lookahead distance is probably the wrong thing.

The right things to do about these problems probably need some
discussion, but here's the best idea I have off-hand: instead of just
have 3 binary heaps (size, XID, error), have N+1 "undo work trackers",
each of which contains 3 binary heaps (size, XID, error). Undo work
tracker #0 contains all requests that are not assigned to any other
undo work tracker. Each of undo work trackers #1..N contain all the
requests for one particular database, but they all start out unused.
Before launching an undo worker for a particular database, the
launcher must check whether it has an undo work tracker allocated to
that database. If not, it allocates one and moves all the work for
that database out of tracker #0 and into the newly-allocated tracker.
If there are none free, it must first deallocate an undo work tracker,
moving any remaining work for that tracker back into tracker #0. With
this approach, there's no need for lookahead, because every worker is
always pulling from a queue that is database-specific, so the next
entry is always guaranteed to be relevant. And you choose N to be
equal to the number of workers, so that even if every worker is in a
separate database there will be enough trackers for all workers to
have one, plus tracker #0 for whatever's left.

There still remains the problem of figuring out when a worker should
terminate to allow for new workers to be launched, which is a fairly
complex problem that deserves its own discussion, but I think this
design helps. At the very least, you can see whether tracker #0 is
empty. If it is, you might still want to rebalance workers between
databases, but you don't really need to worry about databases getting
starved altogether, because you know that you can run a worker for
every database that has any pending undo. If tracker #0 is non-empty
but you have unused workers, you can just allocate trackers for the
databases in tracker #0 and move stuff over there to be processed. If
tracker #0 is non-empty and all workers are allocated, you are going
to need to ask one of them to exit at some point, to avoid starvation.
I don't know exactly what the algorithm for that should be; I do have
some ideas. I'm not going to include them in this email though,
because this email is already long and I don't have time to make it
longer right now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
lookahead-failure-chance.pl text/x-perl-script 376 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-07-05 20:24:39 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Previous Message Bruce Momjian 2019-07-05 20:10:04 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)