Bugs in TOAST handling, OID assignment and redo recovery

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Bugs in TOAST handling, OID assignment and redo recovery
Date: 2018-04-10 10:29:05
Message-ID: CABOikdOgWT2hHkYG3Wwo2cyZJq2zfs1FH0FgX-=h4OLosXHf9w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

One of our 2ndQuadrant support customers recently reported a sudden rush of
TOAST errors post a crash recovery, nearly causing an outage. Most errors
read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While we could bring back the cluster to normal quickly using some
workarounds, I investigated this in more detail and identified two long
standing bugs in TOAST as well as redo recovery.

Immediately after the crash recovery, server started issuing OIDs that
conflicted with the OIDs issued just prior to the crash. The XLOG_NEXTOID
and buffer LSN interlock is supposed to guard against that, but a bug in
the redo recovery breaks that protection. We shall see that in a minute.

When duplicate OIDs are issued, either because of a bug in redo recovery or
because of OID counter wraparound, GetNewOidWithIndex() is supposed to
protect us against that by first scanning the toast table to check if a
duplicate chunk_id or toast value already exists. If so, a new OID is
requested and we repeat until a non-conflicting OID is found.

This mostly works, except in one corner case. GetNewOidWithIndex() uses
SnapshotDirty to look for conflicting OID. But this fails if the existing
toast tuple was recently deleted by another committed transaction. If the
OldestXmin has not advanced enough, such tuples are reported as
RECENTLY_DEAD and fail the SnapshotDirty qualifications.
So toast_save_datum() happily goes ahead and inserts a conflicting toast
tuple. The UNIQUE index on <chunk_id, chunk_seq> does not complain because
it seems the other tuple as RECENTLY_DEAD too. At this point, we have two
toast tuples with the same <chunk_id, chunk_seq> - one of these is
RECENTLY_DEAD and the other is LIVE. Later when toast_fetch_datum() and
friends scan the toast table for the chunk_id (toast value), it uses
SnapshotToast for retrieving the tuple. This special snapshot doesn't do
any visibility checks on the toast tuple, assuming that since the main heap
tuple is visible, the toast tuple(s) should be visible too. This snapshot
unfortunately also sees the RECENTLY_DEAD toast tuple and thus returns
duplicate toast tuples, leading to the errors such as above. This same
issue can also lead to other kinds of errors in tuptoaster.c.

ISTM that the right fix is to teach toast_save_datum() to check for
existence of duplicate chunk_id by scanning the table with the same
SnapshotToast that it later uses to fetch the tuples. We already do that in
case of toast rewrite, but not for regular inserts. I propose to do that
for regular path too, as done in the attached patch.

Duplicate OIDs may be generated when the counter wraps around and it may
cause this problem.

- OID counter wraps around
- Old toast tuple is deleted and the deleting transaction commits
- OldestXmin is held back by an open transaction
- The same OID if then inserted again into the toast table
- The toasted value is fetched before the table is vacuumed or the index
pointer is killed. This leads to duplicates being returned.

This is probably quite a corner case, but not impossible given that we
already have reports for such problems. But the problem gets accentuated in
case of a crash recovery or on a Hot standby. This is because of another
somewhat unrelated bug which manifested in our case.

An interesting case to consider is the following sequence of events:

1. ONLINE checkpoint starts. We note down the nextOid counter, after
rightly accounting for the cached values too. CreateCheckPoint() does this:

LWLockAcquire(OidGenLock, LW_SHARED);
checkPoint.nextOid = ShmemVariableCache->nextOid;
if (!shutdown)
checkPoint.nextOid += ShmemVariableCache->oidCount;
LWLockRelease(OidGenLock);

2. The cached OIDs are then consumed, and a new set of OIDs are allocated,
generating XLOG_NEXTOID WAL record.

3. More OIDs are consumed from the newly allocated range. Those OIDs make
to the disk after following buffer LSN interlock. So XLOG_NEXTOID gets
flushed to disk as well.

4. Checkpoint finishes, it writes a ONLINE checkpoint record with the
nextOid value saved in step 1

5. CRASH

The redo recovery starts at some LSN written before step 1. It initialises
the ShmemVariableCache->nextOid with the value stored in the checkpoint
record and starts replaying WAL records. When it sees the XLOG_NEXTOID WAL
record generated at step 2, it correctly advances the OID counter to cover
the next range.

But while applying ONLINE checkpoint written at step 4, we do this:

/* ... but still treat OID counter as exact */
LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
ShmemVariableCache->nextOid = checkPoint.nextOid;
ShmemVariableCache->oidCount = 0;
LWLockRelease(OidGenLock);

So effectively we overwrite the nextOid counter and drag it backwards
again. If we don't see another XLOG_NEXTOID record before the crash
recovery ends, then we shall start the server with a stale value,
generating duplicate OIDs immediately after the restart.

I don't think this is an unusual case and I'm surprised that we didn't
notice this before. Most likely the other protections in the system to
guard against wrapped around OIDs masked the bug. The TOAST table though
escapes those protections in the specific case we discussed above and
brought out the bug in open.

I think the right fix is to simply ignore the nextOid counter while
replaying ONLINE checkpoint record. We must have already initialised
ShmemVariableCache->nextOid
to the value stored in this (or some previous) checkpoint record when redo
recovery is started. As and when we see XLOG_NEXTOID record, we would
maintain the shared memory counter correctly. If we don't see any
XLOG_NEXTOID record, the value we started with must be the correct value. I
see no problem even when OID wraps around during redo recovery.
XLOG_NEXTOID should record that correctly.

Curiously neither REINDEX nor VACUUM would fix it until the OldestXmin
counter advances enough so that RECENTLY_DEAD tuples are ignored by the
REINDEX or removed by VACUUM. But after OldestXmin progresses, a REINDEX or
a VACUUM should fix these transient errors. Also once manifested, the
errors would stay until a REINDEX or VACUUM is performed.

We tested this at least upto 9.1 and the bugs exist there too. In fact,
these bugs probably existed forever, though I did not check very old
releases.

Attached is a simple reproducer and a proposed fix to address both the
bugs. We should consider backpatching it all supported releases.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
pg_toast_corruption_v3.patch application/octet-stream 2.5 KB
toast_corruption.sh text/x-sh 3.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sergei Kornilov 2018-04-10 10:35:24 Re: Custom PGC_POSTMASTER GUC variables ... feasible?
Previous Message Jim Finnerty 2018-04-10 10:26:00 Re: Custom PGC_POSTMASTER GUC variables ... feasible?