WAL dirty-buffer management bug

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: WAL dirty-buffer management bug
Date: 2006-03-30 18:51:56
Message-ID: 16678.1143744716@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I added some notes to src/backend/access/transam/README explaining the
protocol for executing a WAL-logged action:

---

1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
to be modified.

2. START_CRIT_SECTION() (Any error during the next two steps must cause a
PANIC because the shared buffers will contain unlogged changes, which we
have to ensure don't get to disk. Obviously, you should check conditions
such as whether there's enough free space on the page before you start the
critical section.)

3. Apply the required changes to the shared buffer(s).

4. Build a WAL log record and pass it to XLogInsert(); then update the page's
LSN and TLI using the returned XLOG location. For instance,

recptr = XLogInsert(rmgr_id, info, rdata);

PageSetLSN(dp, recptr);
PageSetTLI(dp, ThisTimeLineID);

5. END_CRIT_SECTION()

6. Unlock and write the buffer(s):

LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
WriteBuffer(buffer);

(Note: WriteBuffer doesn't really "write" the buffer anymore, it just marks it
dirty and unpins it. The write will not happen until a checkpoint occurs or
the shared buffer is needed for another page.)

---

This is pretty much what heapam and btree currently do, but on looking
at it I think it's got a problem: we really ought to mark the buffer
dirty before releasing the critical section. Otherwise, if there's an
elog(ERROR) before the WriteBuffer call is reached, the backend would go
on about its business, and we'd have changes in a disk buffer that isn't
marked dirty. The changes would be uncommitted, presumably, because of
the error --- but nonetheless this could result in inconsistency down
the road. One example scenario is:
1. We insert a tuple at, say, index 3 on a page.
2. elog after making the XLOG entry, but before WriteBuffer.
3. page is later discarded from shared buffers; since it's not
marked dirty, it'll just be dropped without writing it.
4. Later we need to insert another tuple in same table, and
we again choose index 3 on this page as the place to put it.
5. system crash leads to replay from WAL.
Now we'll have two different WAL records trying to insert tuple 3.
Not good.

While there's not much chance of an elog in just LockBuffer/WriteBuffer,
some of the code paths in btree do quite a bit of stuff before they
write the buffer (this was probably done to avoid multiple entries to
bufmgr, back when the BufMgrLock was a severe source of contention).
So I think there's a nontrivial risk of index corruption from this
coding practice.

I'm thinking we should change the code and the README to specify that
you must mark the buffer dirty before you can END_CRIT_SECTION().
Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonah H. Harris 2006-03-30 19:28:55 Re: pg_class catalog question...
Previous Message Heikki Linnakangas 2006-03-30 18:38:26 Re: Index vacuum improvements