Re: Recovery inconsistencies, standby much larger than primary

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Recovery inconsistencies, standby much larger than primary
Date: 2014-02-15 03:30:45
Message-ID: 31058.1392435045@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2014-02-14 20:46:01 +0000, Greg Stark wrote:
>> Going over this I think this is still a potential issue:
>> On 31 Jan 2014 15:56, "Andres Freund" <andres(at)2ndquadrant(dot)com> wrote:
>>> I am not sure that explains the issue, but I think the redo action for
>>> truncation is not safe across crashes. A XLOG_SMGR_TRUNCATE will just
>>> do a smgrtruncate() (and then mdtruncate) which will iterate over the
>>> segments starting at 0 till mdnblocks()/segment_size and *truncate* but
>>> not delete individual segment files that are not needed anymore, right?
>>> If we crash in the midst of that a new mdtruncate() will be issued, but
>>> it will get a shorter value back from mdnblocks().

>> I'm not too familiar with md.c but my reading of the code is that we
>> truncate the files in reverse order?

> That's what I had assumed as well, but it doesn't look that way:

No, it's deleting forward.

We could probably fix things so it deleted backwards; it'd be a tad
tedious because the list structure isn't organized that way, but we
could do it. Not sure if that's good enough though. If you don't
want to assume the filesystem metadata is coherent after a crash,
we might have nonzero-size segments after zero-size ones, even if
the truncate calls had been issued in the right order.

Another possibility is to keep opening and truncating files until
we don't find the next segment in sequence, looking directly at the
filesystem not at the mdfd chain. I don't think this would be
appropriate in normal operation, but we could do it if InRecovery
(and maybe even only if we don't think the database is consistent?)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2014-02-15 03:51:45 Small psql memory fix
Previous Message Florian Pflug 2014-02-15 03:20:17 Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease