Re: POC: Cleaning up orphaned files using undo logs

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-07-08 10:57:44
Message-ID: CAA4eK1+YraWu84HvfKmejO0us0VvdziAEBv5W6cQGzOgw-+KzQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jul 6, 2019 at 1:47 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > [ new patches ]
>
> I took a look at 0012 today, Amit's patch for extending the binary
> heap machinery, and 0013, Amit's patch for "Infrastructure to register
> and fetch undo action requests."
>

Thanks for looking into the patches.

> I don't think that binaryheap_allocate_shm() is a good design. It
> presupposes that we want to store the binary heap as its own chunk of
> shared memory allocated via ShmemInitStruct(), but we might want to do
> something else, like embed in another structure, store it in a DSM or
> DSA, etc., and this function can't do any of that. I think we should
> have something more like:
>
> extern Size binaryheap_size(int capacity);
> extern void binaryheap_initialize(binaryheap *, int capacity,
> binaryheap_comparator compare, void *arg);
>
> Then the caller can do something like:
>
> sz = binaryheap_size(capacity);
> bh = ShmemInitStruct(name, sz, &found);
> if (!found)
> binaryheap_initialize(bh, capacity, comparator, whatever);
>
> If it wants to get the memory in some other way, it just needs to
> initialize bh differently; the rest is the same. Note that there is
> no need, in this design, for binaryheap_size/initialize to make use of
> "shared" memory. They could equally well be used on backend-local
> memory. They do not need to care. You just provide the memory, and
> they do their thing.
>

I didn't have other use cases in mind and I think to some extent this
argument holds true for existing binaryheap_allocate allocate. If we
want to make it more generic, then shouldn't we try to even change the
existing binaryheap_allocate to use this new model, that way the
binary heap allocation API will be more generic?

..
..
>
> Now, one objection to the above line of attack is the different queues
> actually contain different types of elements. Apparently, the XID
> queue contains elements of type UndoXidQueue and the size queue
> contains elements of type UndoSizeQueue. It is worth noting here that
> these are bad type names, because they sound like they are describing
> a type of queue, but it seems that they are actually describing an
> element in the queue. However, there are two larger problems:
>
> 1. I don't think we should have three different kinds of objects for
> each of the three different queues. It seems like it would be much
> simpler and easier to just have one kind of object that stores all the
> information we need (full_xid, start_urec_ptr, dbid, request_size,
> next_retry_at, error_ocurred_at) and use that everywhere. You could
> object that this would increase the storage space requirement,

Yes, this was the reason to keep them separate, but I see your point.

> but it
> wouldn't be enough to make any real difference and it probably would
> be well worth it for the avoidance of complexity.
>

Okay, will give it a try and see if it can avoid some code complexity.
Along with this, I will investigate your other suggestions related to
code improvements as well.

>
> On another note, UNDO_PEEK_DEPTH is bogus. It's used in UndoGetWork()
> and it passes the depth argument down to GetRollbackHashKeyFromQueue,
> which then does binaryheap_nth() on the relevant queue. Note that
> this function is another places that is ends up duplicating code
> because of the questionable decision to have separate types of queue
> entries for each different queue; otherwise, it could probably just
> take the binary heap into which it's peeking as an argument instead of
> having three different cases. But that's not the main point here. The
> main point is that it calls a function for whichever type of queue
> we've got and gets some kind of queue entry using binaryheap_nth().
> But binaryheap_nth(whatever, 2) does not give you the third-smallest
> element in the binary heap. It gives you the third entry in the
> array, which may or may not have the heap property, but even if it
> does, the third element could be huge. Consider this binary heap:
>
> 0 1 100000 2 3 100001 100002 4 5 6 7 100003 100004 100005 100006
>
> This satisfies the binary heap property, because the element at
> position n is always smaller than the elements at positions 2n+1 and
> 2n+2 (assuming 0-based indexing). But if you want to look at the
> smallest three elements in the heap, you can't just look at indexes
> 0..2. The second-smallest element must be at index 1 or 2, but it
> could be either place. The third-smallest element could be the other
> of 1 and 2, or it could be either child of the smaller one, so there
> are three places it might be. In general, a binary heap is not a good
> data structure for finding the smallest N elements of a collection
> unless N is 1, and what's going to happen with what you've got here is
> that we'll sometimes prioritize an item that would not have been
> pulled from the queue for a long time over one that would have
> otherwise been processed much sooner.
>

You are right that it won't be nth smallest element from the queue and
we don't even care for that here. The peeking logic is not to find
the next prioritized element but to check if we can find some element
for the same database in the next few entries to avoid frequent undo
worker restart.

> I'm not sure that's a
> show-stopper, but it doesn't seem good, and the current patch doesn't
> seem to have any comments justifying it, or at least not in the places
> nearby to where this is actually happening.
>

I agree that we should add more comments explaining this.

> I think there are more problems here, too. Let's suppose that we
> fixed the problem described in the previous paragraph somehow, or
> decided that it won't actually make a big difference and just ignored
> it. Suppose further that we have N active databases which are
> generating undo requests. Luckily, we happen to also have N undo
> workers available, and let's suppose that as of a certain moment in
> time there is exactly one worker in each database. Think about what
> will happen when one of those workers goes to look for the next undo
> request. It's likely that the first request in the queue will be for
> some other database, so it's probably going to have to peak ahead to
> find a request for the database to which it's connected -- let's just
> assume that there is one. How far will it have to peak ahead? Well,
> if the requests are uniformly distributed across databases, each
> request has a 1-in-N chance of being the right one. I wrote a little
> Perl program to estimate the probability that we won't find the next
> request for our databases within 10 requests as a function of the
> number of databases:
>
> 1 databases => failure chance with 10 lookahead is 0.00%
> 2 databases => failure chance with 10 lookahead is 0.10%
> 3 databases => failure chance with 10 lookahead is 1.74%
> 4 databases => failure chance with 10 lookahead is 5.66%
> 5 databases => failure chance with 10 lookahead is 10.74%
> 6 databases => failure chance with 10 lookahead is 16.18%
> 7 databases => failure chance with 10 lookahead is 21.45%
> 8 databases => failure chance with 10 lookahead is 26.31%
> 9 databases => failure chance with 10 lookahead is 30.79%
> 10 databases => failure chance with 10 lookahead is 34.91%
> 11 databases => failure chance with 10 lookahead is 38.58%
> 12 databases => failure chance with 10 lookahead is 41.85%
> 13 databases => failure chance with 10 lookahead is 44.91%
> 14 databases => failure chance with 10 lookahead is 47.69%
> 15 databases => failure chance with 10 lookahead is 50.12%
> 16 databases => failure chance with 10 lookahead is 52.34%
> 17 databases => failure chance with 10 lookahead is 54.53%
> 18 databases => failure chance with 10 lookahead is 56.39%
> 19 databases => failure chance with 10 lookahead is 58.18%
> 20 databases => failure chance with 10 lookahead is 59.86%
>
> Assuming my script (attached) doesn't have a bug, with only 8
> databases, there's better than a 1-in-4 chance that we'll fail to find
> the next entry for the current database within the lookahead window.
>

This is a good test scenario, but I think it has not been considered
that there are multiple queues and we peek into each one.

> That's bad, because then the worker will be sitting around waiting
> when it should be doing stuff. Maybe it will even exit, even though
> there's work to be done, and even though all the other databases have
> their own workers already.
>

I think we should once try with the actual program which can test such
a scenario on the undo patches before reaching any conclusion. I or
one of my colleague will work on this and report back the results.

> You can construct way worse examples than
> this one, too: imagine that there are two databases, each with a
> worker, and one has 99% of the requests and the other one has 1% of
> the requests. It's really unlikely that there's going to be an entry
> for the second database within the lookahead window.
>

I am not sure if that is the case because as soon as the request from
other database get prioritized (say because its XID becomes older) and
came as the first request in one of the queues, the undo worker will
exit (provided it has worked for some threshold time (10s) in that
database) and allow the request from another database to be processed.

> And note that
> increasing the window doesn't really help either: you just need more
> databases than the size of the lookahead window, or even almost as
> many as the lookahead window, and things are going to stop working
> properly.
>
> On the other hand, suppose that you have 10 databases and one undo
> worker. One database is pretty active and generates a continuous
> stream of undo requests at exactly the same speed we can process them.
> The others all have 1 pending undo request. Now, what's going to
> happen is that you'll always find the undo request for the current
> database within the lookahead window. So, you'll never exit.
>

Following the logic given above, I think here also worker will exit as
soon as the request from other database get prioritised.

> But
> that means the undo requests in the other 9 databases will just sit
> there for all eternity, because there's no other worker to process
> them. On the other hand, if you had 11 databases, there's a good
> chance it would work fine, because the new request for the active
> database would likely be outside the lookahead window, and so you'd
> find no work to do and exit, allowing a worker to be started up in
> some other database.
>

As explained above, I think it will work the same way both for 10 or
11 databases. Note, that we don't always try to look ahead. We look
ahead when we have not worked on the current database for some
threshold amount of time.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-07-08 11:07:52 Re: Switching PL/Python to Python 3 by default in PostgreSQL 12
Previous Message Daniel Gustafsson 2019-07-08 10:46:59 Re: How to estimate the shared memory size required for parallel scan?