Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Date: 2013-11-19 13:13:51
Message-ID: 528B640F.50601@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19.11.2013 02:37, Peter Geoghegan wrote:
> On Mon, Nov 18, 2013 at 6:44 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> * It should be usable and perform well for both large batch updates and
>> small transactions.
>
> I think that that's a secondary goal, a question to be considered but
> perhaps deferred during this initial effort. I agree that it certainly
> is important.

Ok. Which use case are you targeting during this initial effort, batch
updates or small OLTP transactions?

>> Anything else I'm missing?
>
> I think so, yes. I'll add:
>
> * Should not deadlock unreasonably.
>
> If the UPDATE case is to work and perform almost as well as a regular
> UPDATE, that must mean that it has essentially the same
> characteristics as plain UPDATE. In particular, I feel fairly strongly
> that it is not okay for upserts to deadlock with each other unless the
> possibility of each transaction locking multiple rows (in an
> inconsistent order) exists.

Agreed.

>> What about exclusion constraints? I'd like to see this work for them as
>> well. Currently, exclusion constraints are checked after the tuple is
>> inserted, and you abort if the constraint was violated. We could still
>> insert the heap and index tuples first, but instead of aborting on
>> violation, we would kill the heap tuple we already inserted and retry. There
>> are some complications there, like how to wake up any other backends that
>> are waiting to grab a lock on the tuple we just killed, but it seems doable.
>
> I agree that it's at least doable.
>
>> That would, however, perform badly and leave garbage behind if there are
>> duplicates. A refinement of that would be to first check for constraint
>> violations, then insert the tuple, and then check again. That would avoid
>> the garbage in most cases, but would perform much more poorly when there are
>> no duplicates, because it needs two index scans for every insertion. A
>> further refinement would be to keep track of how many duplicates there have
>> been recently, and switch between the two strategies based on that.
>
> Seems like an awful lot of additional mechanism.

Not really. Once you have the code in place to do the
kill-inserted-tuple dance on a conflict, all you need is to do an extra
index search before it. And once you have that, it's not hard to add
some kind of a heuristic to either do the pre-check or skip it.

>> That cost of doing two scans could be alleviated by using markpos/restrpos
>> to do the second scan. That is presumably cheaper than starting a whole new
>> scan with the same key. (markpos/restrpos don't currently work for non-MVCC
>> snapshots, so that'd need to be fixed, though)
>
> Well, it seems like we could already use a "pick up where you left
> off" mechanism in the case of regular btree index tuple insertions
> into unique indexes -- after all, we don't do that in the event of
> blocking pending the outcome of the other transaction (that inserted a
> duplicate that we need to conclusively know has or has not committed)
> today. The fact that this doesn't already exist leaves me less than
> optimistic about the prospect of making it work to facilitate a scheme
> such as the one you describe here. (Today we still need to catch a
> committed version of the tuple that would make our tuple a duplicate
> from a fresh index scan, only *after* waiting for a transaction to
> commit/abort at the end of our original index scan). So we're already
> pretty naive about this, even though it would pay to not be.

We just haven't bothered to optimize for the case that you have to wait.
That's going to be slow anyway. Also, after sleeping, the insertion
position might've moved right a lot, if a lot of insertions happened
during the sleep, so it might be best to do a new scan anyway.

> Making something like markpos work for the purposes of an upsert
> implementation seems not only hard, but also like a possible
> modularity violation. Are we not unreasonably constraining the
> implementation going forward? My patch respects the integrity of the
> am abstraction, and doesn't really add any knowledge to the core
> system about how amcanunique index methods might go about implementing
> the new "amlock" method. The core system worries a bit about the "low
> level locks" (as it naively refers to value locks), and doesn't
> consider that it has the right to hold on to them for more than an
> instant, but that's about it. Plus we don't have to worry about
> whether something does or does not work for a certain snapshot type
> with my approach, because as with the current unique index btree
> coding, it operates at a lower level than that, and does not need to
> consider visibility as such.
>
> The markpos and restpos am methods only called for regular index
> (only) scans, that don't need to worry about things that are not
> visible. Of course, upsert needs to worry about
> invisible-but-conclusively-live things. This seems much harder, and
> basically implies value locking of some kind, if I'm not mistaken. So
> have you really gained anything?

I probably shouldn't have mentioned markpos/restrpos, you're right that
it's not a good idea to conflate that with index insertion.
Nevertheless, some kind of an API for doing a duplicate-key check prior
to insertion, and remembering the location for the actual insert later,
seems sensible. It's certainly no more of a modularity violation than
the value-locking scheme you're proposing.

What I'm thinking is a new indexam function, let's call it "pre-insert".
The pre-insert function checks for any possible unique key violations,
just like insertion, but doesn't modify the index. Also, as an
optimization, it can remember the position where the insertion will go
to later, and return an opaque token to represent that. That token can
be passed to the insert-function later, which can use it to quickly
re-find the insert position. In other words, very similar to the
index_lock function you're proposing, but it doesn't keep the page locked.

>> And that detour with exclusion constraints takes me back to the current
>> patch :-). What if you implemented the unique check in a similar fashion too
>> (when doing INSERT ON DUPLICATE KEY ...)? First, scan for a conflicting key,
>> and mark the position. Then do the insertion to that position. If the
>> insertion fails because of a duplicate key (which was inserted after we did
>> the first scan), mark the heap tuple as dead, and start over. The indexam
>> changes would be quite similar to the changes you made in your patch, but
>> instead of keeping the page locked, you'd only hold a pin on the target page
>> (if even that). The first indexam call would check that the key doesn't
>> exist, and remember the insert position. The second call would re-find the
>> previous position, and insert the tuple, checking again that there really
>> wasn't a duplicate key violation. The locking aspects would be less scary
>> than your current patch.
>>
>> I'm not sure if that would perform as well as your current patch. I must
>> admit your current approach is pretty optimal performance-wise. But I'd like
>> to see it, and that would be a solution for exclusion constraints in any
>> case.
>
> I'm certainly not opposed to making something like this work for
> exclusion constraints. Certainly, I want this to be as general as
> possible. But I don't think that it needs to be a blocker, and I don't
> think we gain anything in code footprint by addressing that by being
> as general as possible in our approach to the basic concurrency issue.
> After all, we're going to have to repeat the basic pattern in multiple
> modules.

Well, I don't know what to say. I *do* have a hunch that we'd gain much
in code footprint by making this general. I don't understand what
pattern you'd need to repeat in multiple modules.

Here's a patch, implementing a rough version of the scheme I'm trying to
explain. It's not as polished as yours, but it ought to be enough to
evaluate the code footprint and performance. It doesn't make any changes
to the indexam API, and it works the same with exclusion constraints and
unique constraints. As it stands, it doesn't leave bloat behind, except
when a concurrent insertion with a conflicting key happens between the
first "pre-check" and the actual insertion. That should be rare in practice.

What have you been using to performance test this?

> With exclusion constraints, we'd have to worry about a single slot
> proposed for insertion violating (and therefore presumably obliging us
> to lock) every row in the table. Are we going to have a mechanism for
> spilling a tid array potentially sized in gigabytes to disk (relating
> to just one slot proposed for insertion)? Is it principled to have
> that one slot project out rejects consisting of (say) the entire
> table? Is it even useful to lock multiple rows if we can't really
> update them, because they'll overlap each other when all updated with
> the one value?

Hmm. I think what you're referring to is the case where you try to
insert a row so that it violates an exclusion constraint, and in a way
that it conflicts with a large number of existing tuples. For example,
if you have a calendar application with a constraint that two
reservations must not overlap, and you try to insert a new reservation
that covers, say, a whole decade.

That's not a problem for ON DUPLICATE KEY IGNORE, as you just ignore the
conflict and move on. For ON DUPLICATE KEY LOCK FOR UPDATE, I guess we
would need to handle a large TID array. Or maybe we can arrange it so
that the tuples are locked as we scan them, without having to collect
them all in a large array.

(the attached patch only locks the first existing tuple that conflicts;
that needs to be fixed)

RETURNING REJECTS is not an issue here, as that just returns the
rejected rows we were about to insert, not the existing rows in the table.

- Heikki

Attachment Content-Type Size
insert_on_dup-kill-on-conflict-1.patch text/x-diff 75.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-11-19 13:25:38 Re: Database disconnection and switch inside a single bgworker
Previous Message Robert Haas 2013-11-19 13:11:30 Re: SSL renegotiation