Re: Disaster!

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Disaster!
Date: 2004-01-26 18:23:09
Message-ID: 200401261823.i0QIN9l14874@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Excellent analysis. Thanks. Are there any other cases like this?

---------------------------------------------------------------------------

Tom Lane wrote:
> Okay ... Chris was kind enough to let me examine the WAL logs and
> postmaster stderr log for his recent problem, and I believe that
> I have now achieved a full understanding of what happened. The true
> bug is indeed somewhere else than slru.c, and we would not have found
> it if slru.c had had less-paranoid error checking.
>
> The WAL log shows that checkpoints were happening every five minutes
> up to 2004-01-23 10:13:10, but no checkpoint completion record appears
> after that. However, the system remained up, with plenty of activity,
> until 10:45:24, when it was finally taken down by a panic. The last
> transaction commit records in the WAL log are
>
> commit: 14286807 at 2004-01-23 10:45:23
> commit: 14286811 at 2004-01-23 10:45:24
> commit: 14286814 at 2004-01-23 10:45:24
> commit: 14286824 at 2004-01-23 10:45:24
> commit: 14286825 at 2004-01-23 10:45:24
> commit: 14286836 at 2004-01-23 10:45:24
> commit: 14286838 at 2004-01-23 10:45:24
> commit: 14286850 at 2004-01-23 10:45:24
> commit: 14286851 at 2004-01-23 10:45:24
>
> Over in the postmaster log, the first sign of trouble is
>
> Jan 23 10:18:07 canaveral postgres[20039]: [879-1] LOG: could not close temporary statistics file "/usr/local/pgsql/data/global/pgstat.tmp.20035": No space left on device
>
> and there is a steady stream of transactions failing with out-of-space
> errors over the next half hour, but none of the failures are worse than
> a transaction abort. Finally we see
>
> Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR: could not access status of transaction 0
> Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL: could not write to file "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device
> Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING: AbortTransaction and not in in-progress state
> Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC: could not access status of transaction 0
> Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL: could not write to file "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device
> Jan 23 10:45:24 canaveral postgres[20035]: [5-1] LOG: server process (PID 57237) was terminated by signal 6
> Jan 23 10:45:24 canaveral postgres[20035]: [6-1] LOG: terminating any other active server processes
>
> after which the postmaster's recovery attempts fail, as Chris already
> detailed. (Note: the reference to "transaction 0" is not significant;
> that just happens because SimpleLruWritePage doesn't have a specific
> transaction number to blame its write failures on.)
>
> Those are the observed facts, what's the interpretation? I think it
> shows that Postgres is pretty darn robust, actually. We were able to
> stay up and do useful work for quite a long time with zero free space;
> what's more, we lost no transactions that were successfully committed.
> The data was successfully stored in preallocated WAL space. (If things
> had gone on this way for awhile longer, we would have panicked for lack
> of WAL space, but Chris was actually not anywhere near there; he'd only
> filled about two WAL segments in the half hour of operations.) Note
> also that checkpoints were attempted several times during that interval,
> and they all failed gracefully --- no panic, no incorrect WAL update.
>
> But why did this panic finally happen? The key observation is that
> the first nonexistent page of pg_clog was the page beginning with
> transaction 14286848. Neither this xact nor the following one have any
> commit or abort record in WAL, but we do see entries for 14286850 and
> 14286851. It is also notable that there is no WAL entry for extension
> of pg_clog to include this page --- normally a WAL entry is made each
> time a page of zeroes is added to pg_clog. My interpretation of the
> sequence of events is:
>
> Transaction 14286848 started, and since it was the first for its pg_clog
> page, it tried to do ZeroCLOGPage() for that page (see ExtendCLOG). This
> required making room in the in-memory clog buffers, which required
> dumping one of the previously-buffered clog pages, which failed for lack
> of disk space, leading to this log entry:
>
> Jan 23 10:45:24 canaveral postgres[57237]: [17-1] ERROR: could not access status of transaction 0
> Jan 23 10:45:24 canaveral postgres[57237]: [17-2] DETAIL: could not write to file "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device
> Jan 23 10:45:24 canaveral postgres[57237]: [18-1] WARNING: AbortTransaction and not in in-progress state
>
> (Note: page offset 147456 is the page two before the one containing xid
> 14286848. This page had been allocated in clog buffers but never yet
> successfully written to disk. Ditto for the page in between.) The next
> thing that happened was that transaction xids 14286849 and 14286850 were
> assigned (ie, those xacts started), and then 14286850 tried to commit.
> This again led to a failed attempt to write out a clog page, but this
> time the error was promoted to a panic because it happened inside the
> transaction commit critical section:
>
> Jan 23 10:45:24 canaveral postgres[57237]: [19-1] PANIC: could not access status of transaction 0
> Jan 23 10:45:24 canaveral postgres[57237]: [19-2] DETAIL: could not write to file "/usr/local/pgsql/data/pg_clog/000D" at offset 147456: No space left on device
>
> The final commit record in WAL, from xid 14286851, must have come from a
> different backend that was able to get that far in its commit sequence
> before hearing the all-hands-abandon-ship signal from the postmaster.
> (AFAICT it was just chance that the same backend process was responsible
> for both 14286848 and 14286850. Presumably 14286849 was taken out by
> yet another backend that hadn't gotten as far as trying to commit.)
>
> After Chris had freed some disk space, WAL replay was able to create the
> clog page at offset 147456, because there was a clog-extension WAL entry
> for it within the WAL entries since the last successful checkpoint. It
> was also able to correctly fill that page using the transaction commit
> data in WAL. Likewise for the page after that. But when it got to the
> commit record for 14286850, the error checks in slru.c barfed because
> there was no such page, thus exposing the real problem: there wasn't a
> clog extension WAL record for that page.
>
> In short, the bug is in GetNewTransactionId(), which shorn of extraneous
> details looks like
>
> LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
>
> xid = ShmemVariableCache->nextXid;
>
> TransactionIdAdvance(ShmemVariableCache->nextXid);
>
> ExtendCLOG(xid);
>
> LWLockRelease(XidGenLock);
>
> and the correct fix is to swap the order of the TransactionIdAdvance and
> ExtendCLOG lines. Because these lines are out of order, a failure
> occurring down inside ExtendCLOG leaves the shared-memory copy of
> nextXid already advanced, and so subsequent transactions coming through
> this bit of code will see that they are not the first transaction in
> their page and conclude that they need not do any work to extend clog.
> With the operations in the correct order, ExtendCLOG failure will leave
> the counter unchanged, so every subsequent transaction will try to do
> ExtendCLOG and will fail until some disk space becomes available. (Note
> that this code is *not* in a critical section, indeed it's not yet
> inside a transaction at all, and so failure here does not mean a panic.)
>
> If you like you can think of the real problem with this code as being
> that it violates the basic WAL rule: "make a WAL entry before making
> permanent changes". nextXid isn't on disk, but it's persistent
> shared-memory state, and it mustn't be updated until any WAL entries
> associated with that action have been made.
>
> Any questions?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
> joining column's datatypes do not match
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2004-01-26 18:32:28 Re: What's planned for 7.5?
Previous Message Bruce Momjian 2004-01-26 18:04:12 Re: Disaster!