SPGist "triple parity" concept doesn't work

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-hackers(at)postgreSQL(dot)org, hailong(dot)li(at)qunar(dot)com
Subject: SPGist "triple parity" concept doesn't work
Date: 2013-06-06 21:46:45
Message-ID: 5829.1370555205@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been looking into the problem reported at
http://www.postgresql.org/message-id/519A5917.40704@qunar.com
and what I find is that we have spgist insertion operations deadlocking
against each other because one is descending from page A to page B while
the other descends from page B to page A. According to the README file,
the "triple parity" page selection algorithm is supposed to prevent
that:

: While descending the tree, the insertion algorithm holds exclusive lock on
: two tree levels at a time, ie both parent and child pages (parent and child
: pages can be the same, see notes above). There is a possibility of deadlock
: between two insertions if there are cross-referenced pages in different
: branches. That is, if inner tuple on page M has a child on page N while
: an inner tuple from another branch is on page N and has a child on page M,
: then two insertions descending the two branches could deadlock. To prevent
: deadlocks we introduce a concept of "triple parity" of pages: if inner tuple
: is on page with BlockNumber N, then its child tuples should be placed on the
: same page, or else on a page with BlockNumber M where (N+1) mod 3 == M mod 3.
: This rule guarantees that tuples on page M will have no children on page N,
: since (M+1) mod 3 != N mod 3.

That would work fine as long as the invariant is maintained accurately.
However, there are at least two cases where the existing code fails to
maintain the invariant:

1. In spgAddNodeAction, if the enlarged inner tuple doesn't fit on the
current page anymore, we do this:

/*
* obtain new buffer with the same parity as current, since it will be
* a child of same parent tuple
*/
current->buffer = SpGistGetBuffer(index,
GBUF_INNER_PARITY(current->blkno),
...

That's fine as long as the parent tuple wasn't also on the current page.
If it was on the current page, we end up re-downlinking the parent to a
page having the same parity it has, not one more as it should be.

I tried to fix this like so:

/*
* get a new buffer that has the right parity to store a child of
* the current tuple's parent
*/
current->buffer = SpGistGetBuffer(index,
GBUF_INNER_PARITY(parent->blkno + 1),
...

but that just moves the problem somewhere else: the link from the parent
to the new inner tuple is now guaranteed to follow the parity rules, but
the downlinks leading out of it don't follow them anymore.

2. In spgSplitNodeAction, we split an inner tuple into a "prefix" tuple
that replaces that inner tuple, and a "postfix" tuple that contains the
same downlinks the original tuple did. That's fine as long as we can
fit the postfix tuple on the same page. If we can't, we assign it to a
page that's one parity level below the current page, and then its
outgoing links violate the parity rules. (Keeping the postfix tuple
on the current page wouldn't make things better, since we'd still
violate the parity rules with respect to either the incoming or outgoing
links of the prefix tuple, if it has to go to another page.)

With a few repetitions of either of these cases, and some bad luck
about placement of the new tuples, you get into situations where two
pages each contain downlinks leading to the other; and then a deadlock
is just a matter of time.

I don't immediately see any good way to fix this. I think the "triple
parity" rule as it stands is hopelessly broken, but I don't know what
to replace it with, even granting that we don't need to maintain on-disk
compatibility. (We'd have to tell people to reindex SPGist indexes
anyway, because of the risk that they already contain circular links;
so switching to a new layout rule doesn't seem to add any more pain.)
Or we could try to modify the insertion algorithm so it doesn't lock
two levels of the tree at once, but that seems pretty risky.

Thoughts?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2013-06-06 22:28:10 Re: Freezing without write I/O
Previous Message Andres Freund 2013-06-06 21:38:34 Re: Hard limit on WAL space used (because PANIC sucks)