Re: Serializable snapshot isolation patch

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Jeff Davis" <pgsql(at)j-davis(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Serializable snapshot isolation patch
Date: 2010-10-18 18:26:07
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

First off, thanks for the review! I know that it's a lot of work,
and I do appreciate it.

Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> * Trivial stuff:
> I get a compiler warning:
> indexfsm.c: In function *RecordFreeIndexPage*:
> indexfsm.c:55: warning: implicit declaration of function
> *PageIsPredicateLocked*

Oops. Fixed on my git repo now.

> * Open issues, as I see it:
> 1. 2PC and SSI don't mix (this may be a known issue, because
> there's not really any code in the current implementation to deal
> with 2PC):

> Looks like we need to track information about prepared
> transactions in shared memory. I think you'll need to keep the
> information in the 2PC state file as well, so that it can be
> rebuilt after a crash or restart. It all looks solvable at first
> glance, but it looks like it might be some work.

Argh! I didn't anticipate an interaction with prepared
transactions, probably because I've never used them or taken a close
look at them. I'll take a look.

> 2. I think there's a GiST bug (illustrating with PERIOD type):

> Can you clarify your design for GiST and the interaction with
> page-level locks? It looks like you're making some assumption
> about which pages will be visited when searching for conflicting
> values which doesn't hold true. However, that seems odd, because
> even if the value is actually inserted in one transaction, the
> other doesn't seem to find the conflict. Perhaps the bug is
> simpler than that? Or perhaps I have some kind of odd bug in
> PERIOD's gist implementation?

My assumptions for GiST were that:

(1) A search for a matching value could bail out at any level in the
tree; there is no requirement for the search to proceed to the leaf
level to get a negative index probe.

(2) An index insertion which would cause a row to be found if an
earlier search was repeated must modify some page which was read by
the earlier search.

(3) Because of the above, it is necessary and sufficient to acquire
an SIRead lock all pages visited in a search of a GiST index, and
flag a conflict on insertion into a locked page at any level of the

That logic still seems sound to me, so if someone else sees a flaw
in it, please point it out. Assuming that logic is sound, I'll poke
around to see where the flaw in implementation may be. If you have
a full self-contained test case to demonstrate the failure here,
could you send it to me?

> Also, it appears to be non-deterministic, to a degree at least, so
> you may not observe the problem in the exact way that I do.

Yeah, I have tested this area without seeing the failure , so that
self-contained example would be a big help.

> 3. Limited shared memory space to hold information about committed
> transactions that are still "interesting". Relevant thread:
> It's a challenging problem, however, and the current solution is
> less than ideal.

I'd go further than that and say that it clearly needs to be fixed.
The scale of the issue was somewhat disguised in my testing when I
was using the hash table for managing acquisition of shared memory
for what was essentially an unordered list. Now that I've made it a
proper list with a hard limit on entries, the problem is pretty easy
to provoke, and just reserving space for A Very Large Number Of
Entries is clearly not an adequate solution. :-(

> Idle transactions can mean that all new serializable transactions
> fail until the idle transactions start to terminate. I don't like
> that very much, because people expect to have to retry
> serializable transactions, but retrying here has no real hope
> (except that some time has elapsed, and maybe the other
> transaction decided to commit).


> Does it make sense to kill the existing transactions that are
> holding everything up, rather than the new transaction? Or would
> that just confuse matters more? This does not necessarily
> guarantee that progress can be made, either, but intuitively it
> seems more likely.

Canceling an old transaction to allow new transactions to begin
*might* be better than the current situation, but I think we can and
should do better. The best I have been able to come up with is in
this post:

There's a fair amount of work there, so I was hoping for some
confirmation that this combination of steps was both sane and had
some chance of being considered sufficient and acceptable before
diving in to code it. I also *really* hope to add the "SERIALIZABLE
READ ONLY DEFERRABLE" mode so that pg_dump and other read-only
transactions don't push things into a state where the rollback rate

> 4. A few final details:
> a. We should probably have a compatibility GUC that makes
> SERIALIZABLE equal to REPEATABLE READ. My opinion is that this
> should be only for compatibility, and should default to "off"
> (i.e. SSI code enabled) either in 9.1 or soon after.

I'm inclined to agree with you, with the strong preference for a 9.1
setting of off. This was previously discussed, and there were
people who felt that we should avoid a "behavior-changing" GUC like
that, so I didn't add it. It wouldn't be hard to put it in, if
that's the community consensus.

> b. Docs.
> * Questions:
> 1. For TargetTagIsCoveredBy(), why is it impossible for the
> covering tag to have an offset?

Because a "covering" lock is a lock at a coarser granularity than
the "covered" lock, which includes what the finer-grained lock
covers. Since the tuple level is as fine as we take it, and we only
use the offset for tuple locks, a covering lock would not have that
set. (We also check for duplicate locks.) I'll expand that comment
a bit, since it obviously wasn't clear enough.

> 2. The explanation for SerializablePredicateLockListLock is a
> little confusing. It makes it sound like other processes can't
> walk the list, but they can modify it?

That's a bit unconventional, I admit, but it was a big part of
bringing the benchmarks for a fully cached, read-only transaction
load down to running just 1.8% longer than REPEATABLE READ (which is
the same as current SERIALIZABLE). I'm open to other locking
strategies, or an explanation of where this one doesn't work, but
I'd want to benchmark carefully. Since you took it to mean the
right thing, but found that surprising enough to doubt that it was
accurate, do you have any particular suggestions about how to
clarify it? (Maybe just, "Yeah, that's unconventional, but it works
and is faster than the alternatives we've tried."?)

> * Summary
> Great patch!


> I didn't make it through the patch in as much detail as I would
> have liked, because the theory behind it is quite complex and it
> will take longer for me to absorb.

Yeah, there are some mind-bending ideas in there. If I recall
correctly, Dan said he spent one week reading the academic papers on
which it was based and another week reading the patch before he felt
comfortable starting to code the parts he wrote. I spent a lot more
time than that reading papers, and I still didn't quite have my
head around some of the concepts until I went back to some of
Fekete's work which led up to this innovation. While many of the
concepts go back to the '80s and '90s, I found Fekete's work
"clicked" with me better than some of the earlier, more theoretical
work because he was looking at real-world examples of how these
manifest in actual, working production software and what techniques
could be used to identify and mitigate the problems. Sometimes the
theory is hard for me to properly digest without such concrete

> But the implementation looks good and the use case is very
> important.

It certainly is for our shop, and I've heard from others who are
very interested in it. I think, that it's generally of more
benefit to "big shops" than situations where you have just a few
programmers coding against a schema with just a few dozen tables;
although it's also somewhat a matter of style: I would rather deal
with *one* issue (serialization failures) in one place than scatter
defensive code all over my application codebase, even on a moderate

Anyway, given the issues raised and the date, it seems reasonable to
mark this as "Returned with Feedback" for CF management purposes.
I'll get fixes out as soon as I can.


In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2010-10-18 18:26:32 Re: create tablespace fails silently, or succeeds improperly
Previous Message Bruce Momjian 2010-10-18 18:20:02 Re: create tablespace fails silently, or succeeds improperly