Re: INSERT ... ON CONFLICT UPDATE and logical decoding

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: INSERT ... ON CONFLICT UPDATE and logical decoding
Date: 2015-03-04 02:48:15
Message-ID: CAM3SWZT7sSR26+XFLHhaso6KqRTOUWthOBVJxA517X1ioY+WgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 3, 2015 at 3:13 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> toast_save_datum() is called with a heap_insert() call before heap
>> insertion for the tuple proper. We're relying on the assumption that
>> if there is no immediate super deletion record, things are fine. We
>> cannot speculatively insert into catalogs, BTW.
>
> I fail to see what prohibits e.g. another catalog modifying transaction
> to commit inbetween and distribute a new snapshot.

SnapBuildDistributeNewCatalogSnapshot()/REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT
does look like a problematic case. It's problematic specifically
because it involves some xact queuing a change to *some other* xact -
we cannot assume that this won't happen between WAL-logging within
heap_insert() and heap_delete(). Can you think of any other such
cases?

I think that REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID should be fine.
REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID is not an issue either. In
both cases I believe my assumption does not break because there won't
be writes to system catalogs between the relevant heap_insert() and
heap_delete() calls, which are tightly coupled, and also because
speculative insertion into catalogs is unsupported. That just leaves
the non-"*CHANGE_INTERAL_* "(regular DML) cases, which should also be
fine.

As for SnapBuildDistributeNewCatalogSnapshot(), I'm open to
suggestions. Do you see any opportunity around making assumptions
about heavyweight locking making the interleaving of some
REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT change between a
REORDER_BUFFER_CHANGE_INTERNAL_INSERT change and a
REORDER_BUFFER_CHANGE_INTERNAL_DELETE change? AFAICT, that's all this
approach really needs to worry about (or the interleaving of something
else not under the control of speculative insertion - doesn't have to
be a REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT, that's just the most
obvious problematic case).

>> Why should we see a REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID case
>> next, in the case that we need to care about (when the tuple was super
>> deleted immediately afterwards)?
>
> It's irrelevant whether you care about it or not. Your
> ReorderBufferIterTXNNext() consumes a change that needs to actually be
> processed. It could very well be something important like a new
> snapshot.

But it is actually processed, in the next iteration (we avoid calling
ReorderBufferIterTXNNext() at the top of the loop if it has already
been called for that iteration as part of peeking ahead). AFAICT all
that is at issue is the safety of one particular assumption I've made:
that it is safe to assume that there will never be a super deletion in
the event of not seeing a super deletion change immediately following
a speculative insertion change within some xact when decoding. It's
still going to be processed if it's something else. The implementation
will, however, fail to consolidate the speculative insertion change
and super deletion change if my assumption is ever faulty.

This whole approach doesn't seem to be all that different to how there
is currently coordination within ReorderBufferCommit() between
TOAST-related changes, and changes to the relation proper. In fact,
I've now duplicated the way the IsToastRelation() case performs
"dlist_delete(&change->node)" in order to avoid chunk reuse in the
event of spilling to disk. Stress testing by decreasing
"max_changes_in_memory" to 1 does not exhibit broken behavior, I
believe (although that does break the test_decoding regression tests
on the master branch, FWIW). Also, obviously I have not considered the
SnapBuildDistributeNewCatalogSnapshot() case too closely.

I attach an updated patch that I believe at least handles the
serialization aspects correctly, FYI. Note that I assert that
REORDER_BUFFER_CHANGE_INTERNAL_DELETE follows a
REORDER_BUFFER_CHANGE_INTERNAL_INSERT, so if you can find a way to
break the assumption it should cause an assertion failure, which is
something to look out for during testing.

>> While what I have here *is* very ugly, I see no better alternative. By
>> doing what you suggest, we'd be finding a special case way of safely
>> violating the general "WAL log-before-data" rule.
>
> No, it won't. We don't WAL log modifications that don't need to persist
> in a bunch of places. It'd be perfectly fine to start upsert with a
> 'acquiring a insertion lock' record that pretty much only contains the
> item pointer (and a FPI if necessary to prevent torn pages). And then
> only log the full new record once it's sure that the whole thing needs
> to survive. Redo than can simply clean out the size touched by the
> insertion lock.

That seems like a lot of work in the general case to not do something
that will only very rarely need to occur anyway. The optimistic
approach of value locking scheme #2 has a small race that is detected
(which necessitates super deletion), but that will only very rarely be
required. Also, there is value in making super deletion (and
speculative insertion) as close as possible to regular deletion and
insertion.

--
Peter Geoghegan

Attachment Content-Type Size
0004-Logical-decoding-support-for-ON-CONFLICT-UPDATE.patch text/x-patch 8.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-03-04 02:49:21 Re: Bootstrap DATA is a pita
Previous Message Kouhei Kaigai 2015-03-04 02:41:58 Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)