Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Date: 2013-09-14 22:27:10
Message-ID: CAM3SWZQuzPfb+BsuiJQFWErvqx8qqNLuBqn_0xZjEcPRmbz6Ug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Sep 14, 2013 at 12:22 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I mean, if we do the promise tuple
>> thing, and there are multiple unique indexes, what happens when an
>> inserter needs to block pending the outcome of another transaction?
>> They had better go clean up the promise tuples from the other unique
>> indexes that they're trying to insert into, because they cannot afford
>> to hold value locks for a long time, no matter how they're
>> implemented.
>
> As Andres already pointed out, this is not correct.

While not doing this is not incorrect, it certainly would be useful
for preventing deadlocks and unnecessary contention. In a world where
people expect either an insert or an update, we ought to try and
reduce contention across multiple unique indexes. I can understand why
that doesn't matter today, though - if you're going to insert
duplicates indifferent to whether or not there will be conflicts,
that's a kind of abuse, and not worth optimizing - seems probable that
most transactions will commit. However, it seems much less probable
that most upserters will insert. People may well wish to upsert all
the time where an insert is hardly ever necessary, which is one reason
why I have doubts about other proposals.

Note that today there is no guarantee that the original waiter for a
duplicate-inserting xact to complete will be the first one to get a
second chance, so I think it's hard to question this on correctness
grounds. Even if they are released in FIFO order, there is no reason
to assume that the first waiter will win the race with a second. Most
obviously, the second waiter may not even ever get the chance to block
on the same xid at all (so it's not really a waiter at all) and still
be able to insert, if the blocking-xact aborts after the second
"waiter" starts its descent but before it checks uniqueness. All this,
even though the second "waiter" arrived maybe minutes after the first.

What I'm talking about here is really unlikely to result in lock
starvation, because the original waiter typically gets to observe the
other waiter go through, and that's reason enough to give up entirely.
Now, it's kind of weird that the original waiter will still end up
blocking on the xid that caused it to wait in the first instance. So
there should be more thought put into that, like remembering the xid
and only waiting on it on a retry, or some similar scheme. Maybe you
could contrive a scenario where this causes lock starvation, but I
suspect you could do the same thing for the present btree insertion
code.

> Just to add to
> what he said, we already have long-lasting value locks in the form of
> SIREAD locks. SIREAD can exist at different levels of granularity, but
> one of those levels is index-page-level granularity, where they have
> the function of guarding against concurrent insertions of values that
> would fall within that page, which just so happens to be the same
> thing you want to do here. The difference between those locks and
> what you're proposing here is that they are implemented differently.
> That is why those were acceptable and this is not.

As the implementer of this patch, I'm obligated to put some checks in
unique index insertion that everyone has to care about. There is no
way around that. Complexity issues aside, I think that an argument
could be made for this approach *reducing* the impact on concurrency
relative to other approaches, if there isn't too many unique indexes
to deal with, which is the case the vast majority of the time. I mean,
those other approaches necessitate doing so much more with *exclusive*
locks held. Like inserting, maybe doing a page split, WAL-logging, all
with the lock, and then either updating in place or killing the
promise tuple, and WAL-logging that, with an exclusive lock held the
second time around. Plus searching for everything twice. I think that
frequently killing all of those broken-promise tuples could have
deleterious effects on concurrency and/or index bloat or the kind only
remedied by reindex. Do you update the freespace map too? More
exclusive locks! Or if you leave it up to VACUUM (and just set the xid
to InvalidXid, which is still extra work), autovacuum has to care
about a new *class* of bloat - index-only bloat. Plus lots of dead
duplicates are bad for performance in btrees generally.

> As here, that is way more expensive than
> simply grabbing and holding a share-lock on the page. But we get a
> number of important benefits out of it. The backend remains
> interruptible while the tuple is locked, the protocol for granting
> locks is FIFO to prevent starvation, we don't suppress page eviction
> while the lock is held, we can simultaneously lock arbitrarily large
> numbers of tuples, and deadlocks are detect and handled cleanly. If
> those requirements were negotiable, we would surely have negotiated
> them away already, because the performance benefits would be immense.

False equivalence. We only need to lock as many unique index *values*
(not tuples) as are proposed for insertion per slot (which can be
reasonably bound), and only for an instant. Clearly it would be
totally unacceptable if tuple-level locks made backends
uninterruptible indefinitely. Of course, this is nothing like that.

>> If the value locks were made interruptible through some method, such
>> as the promise tuples approach, does that really make deadlocking
>> acceptable?
>
> Yes. It's not possible to prevent all deadlocks. It IS possible to
> make sure that they are properly detected and that precisely one of
> the transactions involved is rolled back to resolve the deadlock.

You seem to have misunderstood me here, or perhaps I was unclear. I'm
referring to deadlocks that cannot really be predicted or analyzed by
the user at all - see my comments below on insertion order.

>> I don't think non-interruptibility is a problem? Really, do you think
>> that this kind of inflammatory rhetoric helps anybody? I said nothing
>> of the sort. I recall saying something about an engineering trade-off.
>> Of course I value interruptibility.
>
> I don't see what's inflammatory about that statement.

The fact that you simply stated that I don't think
non-interruptibility is a problem in a non-qualified way, obviously.

> Interruptibility is not a nice-to-have that we
> can trade away from time to time; it's essential and non-negotiable.

I seem to recall you saying something about the Linux kernel and their
attitude to interruptibility. Yes, interruptibility is not just a
nice-to-have; it is essentially. However, without dismissing your
other concerns, I have yet to hear a convincing argument as to why
anything I've done here is going to make any difference to
interruptibility that would be appreciable to any human. So far it's
been a slippery slope type argument that can be equally well used to
argue against some facet of almost any substantial patch ever
proposed. I just don't think that regressing interruptibility
marginally is *necessarily* sufficient justification for rejecting an
approach outright. FYI, *that's* how I value interruptibility
generally.

>> In contrast, what I've proposed here is in general quite unlikely to
>> result in any I/O for the duration of the time the locks are held.
>> Only writers will be blocked. And only those inserting into a narrow
>> range of values around the btree leaf page. Much of the work that evehn
>> those writers need to do will be unimpeded anyway; they'll just block
>> on attempting to acquire an exclusive lock on the first btree leaf
>> page that the value they're inserting could be on.
>
> Sure, but you're talking about broadening the problem from the guy
> performing the insert to everybody who might be trying to an insert
> that hits one of the same unique-index pages.

In general, that isn't that much worse than just blocking the value
directly. The number of possible values that could also be blocked is
quite low. The chances of it actually mattering that those additional
values are locked in the still small window in which the buffer locks
are held is in generally fairly low, particularly on larger tables
where there is naturally a large number of possible distinct values. I
will however concede that the impact on inserters that want to insert
a non-locked value that belongs on the locked page or its child might
be worse, but it's already a problem that inserted index tuples can
all end up on the same page, if not to the same extent.

> Instead of holding one
> buffer lock, the guy performing the insert is now holding as many
> buffer locks as there are indexes. That's a non-trivial issue.

Actually, as many buffer locks as there are *unique* indexes. It might
be a non-trivial issue, but this whole problem is decidedly
non-trivial, as I'm sure we can all agree.

> For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes,
> you'll error out. In fact, if you get the number of indexes exactly
> right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and
> panic the whole system.

Oh, come on. We can obviously engineer a solution to that problem. I
don't think I've ever seen a table with close to 100 *unique* indexes.
4 or 5 is a very high number. If we just raised on error if someone
tried to do this with more than 10 unique indexes, I would guess
that'd we'd get exactly zero complaints about it.

> Oh, and if different backends load the index list in different orders,
> because say the system catalog gets vacuumed between their respective
> relcache loads, then they may try to lock the indexes in different
> orders and cause an undetected deadlock.

Undetected deadlock is really not much worse than detected deadlock
here. Either way, it's a bug. And it's something that any kind of
implementation will need to account for. It's not okay to
*unpredictably* deadlock, in a way that the user has no control over.
Today, someone can do an analysis of their application and eliminate
deadlocks if they need to. That might not be terribly practical much
of the time, but it can be done. It certainly is practical to do it in
a localized way. I wouldn't like to compromise that.

So yes, you're right that I need to control for this sort of thing
better than in the extant patch, and in fact this was discussed fairly
early on. But it's an inherent problem.

> And, drifting a bit further off-topic, even to get as far as you have,
> you've added overhead to every lwlock acquisition and release, even
> for people who never use this functionality.

If you look at the code, you'll see that I've made very modest
modifications to LWLockRelease only. I would be extremely surprised if
the overhead was not only in the noise, but was completely impossible
to detect through any conventional benchmark. These are the same kind
of very modest changes made for LWLockAcquireOrWait(), and you said
nothing about that at the time. Despite the fact that you now appear
to think that that whole effort was largely a waste of time.

> That's true but somewhat misleading. Textually most of the function
> holds the buffer lock, but heap_prepare_insert(),
> CheckForSerializableConflictIn(), and RelationGetBufferForTuple(), and
> XLogWrite() are the parts that do substantial amounts of computation,
> and only the last of those happens while holding the buffer lock.

I've already written modifications so that I don't have to do
heap_prepare_insert() with the locks held. There is no reason to call
CheckForSerializableConflictIn() with the additional locks held
either. After all, "For a heap insert, we only need to check for
table-level SSI locks". As for RelationGetBufferForTuple(), yes, the
majority of the time it will have to do very little without acquiring
an exclusive lock, because it's going to get that from the last place
a heap tuple was inserted from.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2013-09-15 00:00:30 Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Previous Message Marko Tiikkaja 2013-09-14 22:26:25 Re: Assertions in PL/PgSQL