Re: Truncation failure in autovacuum results in data corruption (duplicate keys)

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: maumau307(at)gmail(dot)com, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Truncation failure in autovacuum results in data corruption (duplicate keys)
Date: 2018-08-20 15:00:09
Message-ID: CAPpHfdvqWECmi6SWt8K3p16GtObpRgyAGuKzan4w2HGRoFiK=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 18, 2018 at 11:49 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
> > Relation truncation throws away the page image in memory without ever
> > writing it to disk. Then, if the subsequent file truncate step fails,
> > we have a problem, because anyone who goes looking for that page will
> > fetch it afresh from disk and see the tuples as live.
>
> > There are WAL entries recording the row deletions, but that doesn't
> > help unless we crash and replay the WAL.
>
> > It's hard to see a way around this that isn't fairly catastrophic for
> > performance :-(.
>
> Just to throw out a possibly-crazy idea: maybe we could fix this by
> PANIC'ing if truncation fails, so that we replay the row deletions from
> WAL. Obviously this would be intolerable if the case were frequent,
> but we've had only two such complaints in the last nine years, so maybe
> it's tolerable. It seems more attractive than taking a large performance
> hit on truncation speed in normal cases, anyway.

We have only two complaints of data corruption in nine years. But I
suspect that in vast majority of cases truncation error didn't cause
the corruption OR the corruption wasn't noticed. So, once we
introduce PANIC here, we would get way more complaints.

> A gotcha to be concerned about is what happens if we replay from WAL,
> come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation
> failure again, which is surely not unlikely. PANIC'ing again will not
> do. I think we could probably handle that by having the replay code
> path zero out all the pages it was unable to delete; as long as that
> succeeds, we can call it good and move on.
>
> Or maybe just do that in the mainline case too? That is, if ftruncate
> fails, handle it by zeroing the undeletable pages and pressing on?

I've just started really digging into this set of problems. But this
idea looks good for me so soon...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2018-08-20 15:00:39 Re: libpq compression
Previous Message Stephen Frost 2018-08-20 14:56:39 Re: Two proposed modifications to the PostgreSQL FDW