On Mon, Apr 04, 2005 at 07:08:11PM +0100, Simon Riggs wrote:
> IIRC there is not another major system that spills locks to disk and
> there's a big reason: performance is very poor. Other systems accept
> some limitations in order to avoid that. Oracle takes locks and holds
> them within the block itself, which then need to have block cleanup
> performed on them next time they're read. DB2 has a lock table which
> doesn't grow in size or spill to disk, ever.
We can have the in-memory hash table grow to be "enough for most
applications"; say, have it store 10000 locks without resorting to
spilling. As a secondary storage there would be the in-memory buffers
for the spill area; comfortably another 5000 locks. Because there is no
need to have the lock table persist across system crashes (except in
case we implement 2PC), there's no need to fsync() or XLog this
information. So you would not touch disk until you have acquired 15000
locks or so. This doesn't really address your concern, because for
spilling logic to work we may need to evict any transaction's lock, not
just our own. Thus we can't guarantee that only the heavy locker's
locks are spilled (spilt?)
Now, I talked to Bruce via IM the other day and he doesn't like the idea
of using the regular lock manager to lock tuples. It's too heavy, he
says. So he proposed using "phantom Xids" instead. (Some of you may
remember the phantom CommandIds back in the subtransactions day; these
are quite different, so they have none of the problems the others had.)
This idea is explained below.
In that chat, we also commented on a third possible solution: using a
lock manager process, instead of using shared memory for locks. I'm not
sure if this is really workable; there needs to be some means of
communication beyond simple signals for this to be effective, I think.
Bruce didn't seems very fond of this idea; at least he seemed way more
attracted to the phantoms.
So, we now have four ideas for solving the shared-row locking
A. Using the lock manager to lock tuples.
This requires some sort of spill-to-disk logic.
B. Not using the lock manager to lock tuples.
This can be done by using Bruce's Phantom Xids idea.
C. Using a hybrid approach.
This can be done using the unused space in heap pages, as proposed by
Paul Tillotson, and falling back to the lock manager. Thus we still
need some spilling logic. I'm still not clear how would this work
given the current page layout and XLog requirements.
D. Using a lock manager process. With a separate process, we don't need
any of this, because it can use the amount of memory it sees fit. We
may have some communicational overhead which may make the idea
unfeasible. But of course, we would retain a shared memory area for
"caching" (so we wouldn't need to touch the process until we are holding
lots of locks.) Does anyone thinks this is an idea worth pursuing? Or
rather, does anyone see huge showstoppers here?
Using Phantom Xids
The idea here is to use an approach similar to what we use now: mark the
tuples with an Xid when it is locked. A phantom Xid is a sort-of Xid,
with multiple real Xids associated to it. So we mark the tuple with the
regular Xid the first time the share lock is acquired; if a second
transaction wants to lock the tuple, it creates a new phantom Xid which
"contains" the original Xid in the tuple and its own Xid, insert it into
the phantom Xid table, and mark the tuple with that as Xmax.
Thus, the operations we need to implement this are:
1. Create new phantom Xid with two real Xids
2. Find the phantom Xid which has some combination of real Xids
3. Add a real Xid to a phantom Xid
4. Cleanup the whole thing
Bruce proposes using two hash tables for this: one with the phantom Xid
as the key, and other with XOR(real xids) as key. Thus we could implement
1, 2 and 3 easily. Cleanup would happen when somebody else visits the
phantom Xid and finds it populated with non-running transactions (it can
clean the Xmax marking as well, if the phantom Xid ends up empty), and
by VACUUM so that we don't have problems with Xid wraparound.
Alvaro Herrera (<alvherre[(at)]dcc(dot)uchile(dot)cl>)
"La vida es para el que se aventura"
In response to
pgsql-hackers by date
|Next:||From: Joshua D. Drake||Date: 2005-04-04 20:32:39|
|Subject: Re: [HACKERS] plPHP in core?|
|Previous:||From: Andrew Dunstan||Date: 2005-04-04 20:17:59|
|Subject: Re: [HACKERS] plPHP in core?|