We have got a serious problem with pg_clog/WAL synchronization

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
Subject: We have got a serious problem with pg_clog/WAL synchronization
Date: 2004-08-10 16:17:11
Message-ID: 29031.1092154631@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While investigating Satoshi Okada's bug report here
http://archives.postgresql.org/pgsql-hackers/2004-08/msg00510.php
I realized that it actually represents a crash-safety risk that has
existed since 7.2.

<lecture>
Allow me to refresh your memory about the principles of write-ahead
logging. The one that everyone remembers is "a WAL entry must hit disk
before any of the data changes it describes". But there is a different
constraint that must also be met, which is "a checkpoint must flush all
data changes of preceding WAL entries". In detail, a checkpoint does:

1. Note the current end-of-WAL position (where the next WAL entry will
be made). This is the checkpoint's "REDO" pointer.

2. Flush all dirty buffers to disk, and fsync all changes.

3. Add a checkpoint record to WAL, and flush it to disk.

(Note that when there is concurrent activity, other WAL records may be
added to WAL between steps 1 and 3, so that the checkpoint record's
physical location is later than its REDO pointer. This is okay. The
added records are logically after the checkpoint, even though physically
located before it in the WAL data.)

If we now suffer a crash, log replay will be executed starting at the
REDO point. Since records before the REDO point will not be replayed,
it is critical that the "flush" operations in step 2 have written all
the effects of such records to disk.
</lecture>

Satoshi-san's bug report shows a way to cause the system to sometimes
violate this constraint. In particular, what I saw was a transaction
commit WAL record that was just before the REDO pointer of a checkpoint,
but the pg_clog status update for it had not been flushed to disk by the
checkpoint. The reason this is possible is that RecordTransactionCommit
first writes the commit record, and fsyncs it, and only then goes and
makes the pg_clog status update in shared memory. There is thus a
window for a checkpoint to start, note its REDO point (after the
commit), and flush the current contents of pg_clog buffers out to disk
before the transaction has updated its state in pg_clog.

This has been broken since the original design of pg_clog in 7.2 :-(.
I fear I have to take the blame for it.

(Just to add insult to injury: if you enable commit_delay then the sleep
occurs during the window of vulnerability.)

What I am thinking of doing to fix the problem is to introduce
a new LWLock that RecordTransactionCommit will take a shared lock on
before writing its WAL record, and not release until it has updated
pg_clog. Checkpoint start will acquire the lock exclusively just long
enough to do its step 1 (establish REDO point). This is slightly
annoying since it means one more high-traffic lock to grab during
commit, but I don't see any other good solution. We will certainly have
to back-patch this into 7.4 and I suppose we should think about issuing
new 7.3 and 7.2 releases as well.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2004-08-10 16:24:06 pg_subtrans and WAL
Previous Message Fabien COELHO 2004-08-10 15:55:17 Re: pgsql-server: PostgreSQL extension makefile framework