Shared row locking

From: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To: Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Shared row locking
Date: 2004-12-17 00:54:14
Message-ID: 20041217005414.GA6180@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've been thinking on how to do shared row locking. There are some very
preliminar ideas on this issue. Please comment; particularly if any
part of it sounds unworkable or too incomplete.

There are several problems to be solved here: the grammar, the internal
SelectStmt representation, how to store and share the info between
backends, how to clean up at transaction end, and how to clean up at
backend crash.

The Grammar
===========

The SQL spec does not say anything on this respect (that I can find).
It only talks of "FOR UPDATE" and "FOR READ ONLY". However, because the
FK code uses SPI to do the locking, we definitely have to expose the
funcionality through SQL. So I think we need a new clause, which I
propose to be "FOR SHARE".

The Parser and SelectStmt
=========================

The parser uses for_update_clause and opt_for_update_clause
nonterminals. I assume it's best to change them to (new)
locking_clause, which can in turn be for_update_clause or (new)
for_share_clause.

SelectStmt currently has a forUpdate field (a List to the to-be-locked
tables, or an empty list meaning all of them). We could simply add
another list, say forShare, or use a common list and a flag saying that
it's one or the other. I prefer adding a new list. (Same with the
Query node.)

How to Store the Info
=====================

This is the really interesting part. I have two ideas, one mine (btree)
and other Bruce's (bitmap).

Using a B-tree
--------------
When a backend wants to lock a tuple, it set a bit in its infomask.
Then it inserts to a btree in a special tablespace, using
RelationId-BlockNumber-OffsetNumber as key, and BackendId-TransactionId
as value; actually, an array with a single element containing those two
values.

When a backend wants to lock a tuple that is already locked, it goes to
the btree and inserts itself into the array. To do this, it inserts a
new index item (the enlarged array) and delete the previous one. No
other backend may want to insert simultaneously (thus causing an ugly
race condition), because we hold an exclusive lock on the tuple's heap
page's buffer.

At transaction end, nothing special happens (tuples are not unlocked
explicitly).

When someone wants to know if the tuple is locked (to mark it FOR
UPDATE, or to delete it), it checks the infomask. If it says it's
locked, it goes to check the btree. If the array contains only
BackendId-TransactionId pairs that are no longer running, then the
tuple is not locked and can be deleted/marked (and the btree can be
cleaned up). Else, it will have to wait, using XactLockTableWait, for
the first transaction in the array that is still running. We can be
sure that no one will try to share-lock the tuple while we check the
btree because we hold an exclusive lock on the tuple's heap page's
buffer.

Note that to check whether a transaction is running we need to lock
SInvalLock. To minimize the time we hold it, we save the BackendId so
we don't have to scan the whole shmInvalBuffer->procState array, only
the item that we need to look at. Another possibility would be to use
stock TransactionIdIsInProgress and save the extra 4 bytes of storage.

At server restart, the btree is created empty (or just deleted). There
is one btree per database.

Using a Bitmap
--------------
First we create a counter called shared lock row counter. Then we
create a file like pg_clog, and each counter slot has a bit for every
backend. When we want to shared lock a row we increment the counter and
put that counter value on the row, and set our backend bit in the new
file. We also store that counter value in our backend local memory. On
commit we go through that local memory list and reset all our bits for
those counters. When a row has all zeros, it can be recycled like we do
with pg_clog.

Problems and random comments
============================

There is possibility of starvation, if somebody wants to lock
exclusively a tuple and shared lockers are coming all the time. Not
sure how to solve this.

The wakeup mechanism is not discussed ... is there a special need for
something beyond what we can do with XactLockTable{Insert,Wait} ?

Thanks to tablespaces, it's very easy to create special Relations that
can be dealt with by standard buffer and storage manager, etc.

The btree idea:
- does not need crash recovery. Maybe we could use a stripped down
version of nbtree. This could cause a maintanibility nightmare.

- can't hold more than 300 or so simultaneous lockers (because of value
length, limited to 1/3 of a page). I doubt this is a real problem.

- could have problems (excessive storage requirements) in the long run
because of empty or almost-empty pages.

The bitmap idea:
- seems to have lower overhead

- can use the same lazy cleanup mechanism exposed for the btree idea (in
which case we don't need the list in local memory).

- What can happen in presence of large max_connections settings? Is
this a real problem?

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
Oh, oh, las chicas galacianas, lo harán por las perlas,
¡Y las de Arrakis por el agua! Pero si buscas damas
Que se consuman como llamas, ¡Prueba una hija de Caladan! (Gurney Halleck)

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2004-12-17 01:57:51 Re: Shared row locking
Previous Message Tom Lane 2004-12-17 00:29:48 Re: port report: [FAILURE] FreeBSD 6, Intel icc7