Re: should there be a hard-limit on the number of transactions pending undo?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: should there be a hard-limit on the number of transactions pending undo?
Date: 2019-07-19 22:47:23
Message-ID: CAH2-Wzk06ypb40z3B8HFiSsTVg961=E0=uQvqARJgT8_4QB2Mg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 19, 2019 at 12:52 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Right, that's definitely a big part of the concern here, but I don't
> really believe that retaining locks is absolutely required, or even
> necessarily desirable. For instance, suppose that I create a table,
> bulk-load a whole lotta data into it, and then abort. Further support
> that by the time we start trying to process the undo in the
> background, we can't get the lock. Well, that probably means somebody
> is performing DDL on the table.

I believe that the primary reason why certain other database systems
retain locks until rollback completes (or release their locks in
reverse order, as UNDO processing progresses) is that application code
will often repeat exactly the same actions on receiving a transient
error, until the action finally completes successfully. Just like with
serialization failures, or with manually implemented UPSERT loops that
must sometimes retry. This is why UNDO is often (or always) processed
synchronously, blocking progress of the client connection as its xact
rolls back.

Obviously these other systems could easily hand off the work of
rolling back the transaction to an asynchronous worker process, and
return success to the client that encounters an error (or asks to
abort/roll back) almost immediately. I have to imagine that they
haven't implemented this straightforward optimization because it makes
sense that the cost of rolling back the transaction is primarily borne
by the client that actually rolls back. And, as I said, because a lot
of application code will immediately retry on failure, which needs to
not deadlock with an asynchronous roll back process.

> If they just did LOCK TABLE or ALTER
> TABLE SET STATISTICS, we are going to need to execute that same undo
> once the DDL is complete. However, if the DDL is DROP TABLE, we're
> going to find that once we can get the lock, the undo is obsolete, and
> we don't need to worry about it any more. Had we made it 100% certain
> that the DROP TABLE couldn't go through until the undo was performed,
> we could avoid having to worry about the undo having become obsolete
> ... but that's hardly a win. We're better off allowing the drop and
> then just chucking the undo.

I'm sure that there are cases like that. And, I'm pretty sure that at
least one of the other database systems that I'm thinking of isn't as
naive as I suggest, without being sure of the specifics. The classic
approach is to retain the locks, even though that sucks in some cases.
That doesn't mean that you have to do it that way, but it's probably a
good idea to present your design in a way that compares and contrasts
with the classic approach.

I'm pretty sure that this is related to the way in which other systems
retain coarse-grained locks when bitmap indexes are used, even though
that makes them totally unusable with OLTP apps. It seems like it
would help users a lot if their bitmap indexes didn't come with that
problem, but it's a price that they continue to have to pay.

> Point being - there's at least some chance that the operations which
> block forward progress also represent progress of another sort.

That's good, provided that there isn't observable lock starvation. I
don't think that you need to eliminate the theoretical risk of lock
starvation. It deserves careful, ongoing consideration, though. It's
difficult to codify exactly what I have in mind, but I can give you an
informal definition now: It's probably okay if there is the occasional
implementation-level deadlock because the user got unlucky once.
However, it's not okay for there to be *continual* deadlocks because
the user got unlucky just once. Even if the user had *extraordinarily*
bad luck that one time. In short, my sense is that it's never okay for
the system as a whole to "get stuck" in a deadlock or livelock loop.
Actually, it might even be okay if somebody had a test case that
exhibits "getting stuck" behavior, provided the test case is very
delicate, and looks truly adversarial (i.e. it goes beyond being
extraordinarily unlucky).

I know that this is all pretty hand-wavy, and I don't expect you to
have a definitive response. These are some high level concerns that I
have, that may or may not apply to what you're trying to do.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-07-19 23:12:21 Re: Multivariate MCV list vs. statistics target
Previous Message Andres Freund 2019-07-19 21:17:47 Re: partition routing layering in nodeModifyTable.c