Re: Bugs in TOAST handling, OID assignment and redo recovery

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bugs in TOAST handling, OID assignment and redo recovery
Date: 2018-04-10 13:37:55
Message-ID: d68b7715-d25b-7a0f-477d-6c4816a15ac0@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/04/18 13:29, Pavan Deolasee wrote:
> Hello,
>
> One of our 2ndQuadrant support customers recently reported a sudden rush of
> TOAST errors post a crash recovery, nearly causing an outage. Most errors
> read like this:
>
> ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn
>
> While we could bring back the cluster to normal quickly using some
> workarounds, I investigated this in more detail and identified two long
> standing bugs in TOAST as well as redo recovery.

Great detective work!

> ISTM that the right fix is to teach toast_save_datum() to check for
> existence of duplicate chunk_id by scanning the table with the same
> SnapshotToast that it later uses to fetch the tuples. We already do that in
> case of toast rewrite, but not for regular inserts. I propose to do that
> for regular path too, as done in the attached patch.

It would seem more straightforward to add a snapshot parameter to
GetNewOidWithIndex(), so that the this one caller could pass
SnapshotToast, while others pass SnapshotDirty. With your patch, you
check the index twice: first with SnapshotDirty, in
GetNewOidWithIndex(), and then with SnapshotToast, in the caller.

If I'm reading the rewrite case correctly, it's a bit different and
quite special. In the loop with GetNewOidWithIndex(), it needs to check
that the OID is unused in two tables, the old TOAST table, and the new
one. You can only pass one index to GetNewOidWithIndex(), so it needs to
check the second index manually. It's not because of the snapshot issue.
Although I wonder if we should be using SnapshotToast in that
GetNewOidWithIndex() call, too. I.e. if we should be checking both the
old and the new toast table with SnapshotToast.

> I think the right fix is to simply ignore the nextOid counter while
> replaying ONLINE checkpoint record. We must have already initialised
> ShmemVariableCache->nextOid
> to the value stored in this (or some previous) checkpoint record when redo
> recovery is started. As and when we see XLOG_NEXTOID record, we would
> maintain the shared memory counter correctly. If we don't see any
> XLOG_NEXTOID record, the value we started with must be the correct value. I
> see no problem even when OID wraps around during redo recovery.
> XLOG_NEXTOID should record that correctly.

Agreed. With nextXid, we advance ShmemVariableCache->nextXid if the
value in the online checkpoint record is greater than
ShmemVariableCache->nextXid. But we don't have such a wraparound-aware
concept of "greater than" for OIDs. Ignoring the online checkpoint
record's nextOid value seem fine to me.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-04-10 13:46:32 Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.
Previous Message David Steele 2018-04-10 13:35:39 Re: [PATCH] Add missing type conversion functions for PL/Python