Re: should there be a hard-limit on the number of transactions pending undo?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: should there be a hard-limit on the number of transactions pending undo?
Date: 2019-07-29 19:11:53
Message-ID: CAH2-WzngCCL6nidtrcoTjta2GZP7t-OLkK=6US0-kBCUuXjbGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 29, 2019 at 9:35 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I mean, essentially any well-designed framework intended for any sort
> of task whatsoever is going to have a design center where one can
> foresee that it will work well, and then as a result of working well
> for the thing for which it was designed, it will also work well for
> other things that are sufficiently similar. So, I think you're
> correct, but I also don't think that's really saying very much.

I agree that it's quite unclear how important this is. I don't
necessarily think it matters if zheap doesn't do that well with GIN
indexes. I think it's probably going to be useful to imagine how GIN
indexing might work for zheap because it clarifies the strengths and
weaknesses of your design. It's perfectly fine for there to be
weaknesses, provided that they are well understood.

> The
> trick is to figure out whether and how the ideas you have could be
> generalized with reasonable effort to handle other cases, and that's
> easier with some projects than others. I think when it comes to UNDO,
> it's actually really hard.

I agree.

> Unfortunately, we haven't had many takers so far -- thanks for chiming
> in.

I don't have the ability to express my general concerns here in a very
crisp way. This is complicated stuff. Thanks for tolerating the
hand-wavy nature of my feedback about this.

> I don't really understand your comments about GIN. My simplistic
> understanding of GIN is that it's not very different from btree in
> this regard.

GIN is quite similar to btree from a Postgres point of view -- GIN is
simply a btree that is good at storing duplicates (and has higher
level infrastructure to make things like FTS work). So I'd say that
your understanding is fairly complete, at least as far as traditional
Postgres goes. But if we imagine a system in which we have to roll
back in indexes, it's quite a different story. See my remarks to
Thomas just now about that.

> Suppose we insert a row, and then the insert aborts;
> suppose also that the index wants to use UNDO. In the case of a btree
> index, we're going to go insert an index entry for the new row; upon
> abort, we should undo the index insertion by removing that index tuple
> or at least marking it dead. Unless a page split has happened,
> triggered either by the insertion itself or by subsequent activity,
> this puts the index in a state that is almost perfectly equivalent to
> where we were before the now-aborted transaction did any work. If a
> page split has occurred, trying to undo the index insertion is going
> to run into two problems. One, we probably can't undo the page split,
> so the index will be logically equivalent but not physically
> equivalent after we get rid of the new tuple. Two, if the page split
> happened after the insertion of the new tuple rather than at the same
> time, the index tuple may not be on the page where we left it.

Actually, page splits are the archetypal case where undo cannot
restore the original physical state. In general, we cannot expect the
undo process to reverse page splits. Undo might be able to merge the
pages together, but it also might not be able to. It won't be terribly
different to the situation with deletes where the transaction commits,
most likely.

Some other systems have something called "system transactions" for
things like page splits. They don't need to have their commit record
flushed synchronously, and occur in the foreground of the xact that
needs to split the page. That way, rollback doesn't have to concern
itself with rolling back things that are pretty much impossible to
roll back, like page splits.

> Now, my mental model of a GIN index is that you go find N>=0 index
> keys inside each value and do basically the same thing as you would
> for a btree index for each one of them. Therefore it seems to me,
> possibly stupidly, that you're going to have basically the same
> problems, except each problem will now potentially happen up to N
> times instead of up to 1 time. I assume here that in either case -
> GIN or btree - you would tentatively record where you left the tuple
> that now needs to be zapped and that you can jump to that place
> directly to try to zap it. Possibly those assumptions are bad and
> maybe that's where you're seeing a concurrency/deadlock problem; if
> so, a more detailed explanation would be very helpful.

Imagine a world in which zheap cannot just create a new TID (or HOT
chain) for the same logical tuple, which is something that I believe
should be an important goal for zheap (again, see my remarks to
Thomas). Simplicity for rollbacks in access methods like GIN demands
that you lock the entire index tuple, which may point to hundreds of
logical rows (or TIDs, since they have a 1:1 correspondence with
logical rows in this imaginary world). Rolling back with more granular
locking seems very hard for the same reason that rolling back a page
split would be very hard -- you cannot possibly have enough book
keeping information to make that work in a sane way in the face of
concurrent insertions that may also commit or abort unpredictably. It
seems necessary to bake concurrency control in to roll back at the
index access method level in order to get significant benefits from a
design like zheap.

Now, maybe zheap should be permitted to not work particularly well
with GIN, while teaching btree to take advantage of the common case
where we can roll everything back, even in indexes (so zheap behaves
much more like heapam when you have a GIN index, which is hopefully
not that common). That could be a perfectly reasonable restriction.
But ISTM that you need to make heap TIDs completely stable for the
case that zheap is expected to excel at. You also need to teach nbtree
to take advantage of this by rolling back if and when it's safe to do
so (when we know that heap TIDs are stable for the indexed table).

In general, the only way that rolling back changes to indexes can work
is by making heap TIDs completely stable. Any design for rollback in
nbtree that allows there to be multiple entries for the same logical
row in the index seems like a disaster to me. Are you really going to
put forwarding information in the index that mirrors what has happened
in the table?

> To me, based on my more limited knowledge of indexing, I'm not really
> seeing a concurrency/deadlock issue, but I do see that there's going
> to be a horrid efficiency problem if page splits are common.

I'm not worried about rolling back page splits. That seems to present
us with exactly the same issues as rolling back in GIN indexes
reliably (i.e. problems that are practically impossible to solve, or
at least don't seem worth solving).

> This seems to be a very
> general problem with making undo and indexes work nicely together:
> almost any index type has to sometimes move tuple around to different
> pages, which makes finding them a lot more expensive than re-finding a
> heap tuple.

Right. That's why undo is totally logical in indexes. And it's why you
cannot expect to roll back page splits.

> I think that most of the above is a bit of a diversion from the
> original topic of the thread. I think I see the connection you're
> making between the two topics: the more likely undo application is to
> fail, the more worrying a hard limit is, and deadlocks are a way for
> undo application to fail, and if that's likely to be common when undo
> is applied to indexes, then undo failure will be common and a hard
> limit is bad.

This is an awkward thing to discuss, because it involves so many
interrelated moving parts. And because I know that I could easily miss
quite a bit about the zheap design. Forgive me if I've hijacked the
thread.

> However, I think the solution to that problem is
> page-at-a-time undo: if foreground process needs to modify a page with
> pending undo, and if the modification it wants to make can't be done
> sensibly unless the undo is applied first, it should be prepared to
> apply that undo itself - just for that page - rather than wait for
> somebody else to get it done. That's important not only for deadlock
> avoidance - though deadlock avoidance is certainly a legitimate
> concern - but also because the change might be part of some gigantic
> rollback that's going to take an hour, and waiting for the undo to hit
> all the other pages before it gets to this one will make users very
> sad.

It's something that users in certain other systems (though certainly
not all other systems) have had to live with for some time.

SQL Server 2019 has something called "instantaneous transaction
rollback", which seems to make SQL Server optionally behave a lot more
like Postgres [1], apparently with many of the same disadvantages as
Postgres. I agree that there is probably a middle way that more or
less has the advantages of both approaches. I don't really know what
that should look like, though.

[1] https://www.microsoft.com/en-us/research/uploads/prod/2019/06/p700-antonopoulos.pdf
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2019-07-29 19:39:23 Re: should there be a hard-limit on the number of transactions pending undo?
Previous Message Robert Haas 2019-07-29 19:11:24 Re: should there be a hard-limit on the number of transactions pending undo?