Re: INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
Date: 2015-03-18 18:41:51
Message-ID: 5509C6EF.5040606@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/18/2015 06:23 PM, Robert Haas wrote:
> On Tue, Mar 17, 2015 at 6:27 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> On Tue, Mar 17, 2015 at 12:11 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>> I'm still not sure the way the speculative locking works is the best
>>> approach. Instead of clearing xmin on super-deletion, a new flag on the heap
>>> tuple seems more straightforward. And we could put the speculative insertion
>>> token in t_ctid, instead of stashing it in the PGPROC array. That would
>>> again seem more straightforward.
>>
>> I see the appeal of that approach. What concerns me about that
>> approach is that it makes it the problem of the inserter to confirm
>> its insertion, even though overwhelmingly, speculative insertions
>> succeeds (the race window is small due to the pre-check). The current
>> speculative token is legitimately a very short lived thing, a thing
>> that I hesitate to write to disk at all. So our optimistic approach to
>> inserting speculatively loses its optimism, which I don't like, if you
>> know what I mean.
>
> Modifying a shared buffer is not the same as writing to disk.

AFAICS, there is no need to go and clear the tag after the insert has
completed.

Here's what I had in mind: the inserter tags the tuple with the
speculative insertion token, by storing the token in the t_ctid field.
If the inserter needs to super-delete the tuple, it sets xmax like in a
regular deletion, but also sets another flag to indicate that it was a
super-deletion.

When another backend inserts, and notices that it has a potential
conflict with the first tuple, it tries to acquire a hw-lock on the
token. In most cases, the inserter has long since completed the
insertion, and the acquisition succeeds immediately but you have to
check because the token is not cleared on a completed insertion. It
would be slightly faster for the second backend if the inserter actually
removed the token after the insertion has completed: it wouldn't need to
check the lock if the tuple was not tagged in the first place. But
that'd be just a tiny optimization, for the rare case that there is a
conflict, and surely not worth it.

Hmm, I just realized a little fly in that ointment: if the inserter
inserts enough tuples to wrap around the token counter (4 billion?) in a
single transaction, another backend that inserts could confuse a new
speculative insertion with one that completed a long time ago. The risk
seems small enough that we could just accept it. I don't think anything
particularly bad would happen on a false positive here. Or we could also
use all 48 bits available in the t_ctid to push it truly in the realm of
ain't-gonna-happen. Or we could use (relfilenode, block, token) as the
lock tag for the SpeculativeInsertionLock.

Regarding the physical layout: We can use a magic OffsetNumber value
above MaxOffsetNumber to indicate that the t_ctid field stores a token
rather than a regular ctid value. And another magic t_ctid value to
indicate that a tuple has been super-deleted. The token and the
super-deletion flag are quite ephemeral, they are not needed after the
inserting transaction has completed, so it's nice to not consume the
valuable infomask bits for these things. Those states are conveniently
not possible on an updated tuple, when we would need the t_ctid field
for it's current purpose.

>>> If I'm reading this correctly, if there are multiple indexes that match the
>>> unique index inference specification, we pick the cheapest one. Isn't that
>>> unstable? Two backends running the same INSERT ON CONFLICT statement might
>>> pick different indexes, and the decision might change over time as the table
>>> is analyzed. I think we should have a more robust rule. Could we easily just
>>> use all matching indexes?
>>
>> Robert feel pretty strongly that there should be a costing aspect to
>> this. Initially I was skeptical of this, but just did what he said all
>> the same. But I've since come around to his view entirely because we
>> now support partial unique indexes.
>
> I think Heikki's concern is something different, although I am not
> altogether up to speed on this and may be confused. The issue is:
> suppose that process A and process B are both furiously upserting into
> the same table with the same key column but, because they have
> different costing parameters, process A consistently chooses index X
> and process B consistently chooses index Y. In that situation, will
> we deadlock, livelock, error out, bloat, or get any other undesirable
> behavior, or is that A-OK?

Right, that's what I had in mind.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabrízio de Royes Mello 2015-03-18 18:42:40 Re: Can pg_dump make use of CURRENT/SESSION_USER
Previous Message Fabrízio de Royes Mello 2015-03-18 18:41:41 Re: Can pg_dump make use of CURRENT/SESSION_USER