Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: TipTop Labs <office(at)tiptop-labs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14999: pg_rewind corrupts control file global/pg_control
Date: 2018-04-04 07:26:25
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

On Tue, Apr 03, 2018 at 10:13:05PM +0200, TipTop Labs wrote:
> Truncation of the file was moved to the second loop. Truncation occurs
> there if chunks are written into files at offset 0. This is the case
> for FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures
> that these chunks are processed first.

Yes. I have spent a large portion of my day looking at that. And
actually this is wrong as well. First, please find attached a patch I
have written while testing your changes. There is nothing much new into
it, except that I added more comments and a test (other tests fail as
well, that does not matter).

First, I have looked at a way to not rely on the tweak with chunkoff by
extending the temporary table used to register the chunks with an extra
column which tracks the type of action which is taken on the file.
Another one I looked at was to pass down the action type taken on the
entry directly to receiveFileChunks(). However both of them have been

So after that I falled back to your patch and began testing it, which is
where I noticed that we can *never* give the insurance to recover a data
folder on which an error has happened in the middle of a pg_rewind. The
reason for that is quite simple: even if the truncation has been moved
down to the moment where the first chunk of a file is received, you may
have already done work on some relation files. Particularly, some of
them may have been truncated down to a given size without a new range of
blocks fetched from the source. So the data folder would be in an
inconsistent state if trying to rewind it again.

If the control file from the source has been received and then a
read-only error is found, then in order to perform a subsequent rewind
you need to start and stop the target cluster cleanly, or update
manually the control file and mark it as cleanly stopped. And this is a
recipe for data corruption as we now start a rewind based on the
guarantee that a target cluster has been able to cleanly stop, with a
shutdown checkpoint issued. You could always try to start and stop the
cluster cleanly, but if your target instance replays WAL on data blocks
which have been truncated by a previous rewind, you would need to
re-create an instance using a new base backup, which is actually harder
to diagnose.

The patch I sent prevously which makes the difference between relation
files and "configuration" files helps a bit, still it makes me uneasy
because it will not be able to handle correctly failures on files which
are part of a relation but need to be fully copied. So I think that we
should drop this approach as well for its complexity.

Another thing that could definitely help is to work on a patch which
checks that *all* files are writable before doing any operations for
files which are present on both the source and the target, and make the
rewind nicely fail with the source still intact. That's quite a bit of
refactoring ahead for little benefit I think in the long run.

What we could simply do is to document the limitation. Hence, if both
the target and the source have read-only files, then the user needs to
be sure to remove them from the target before doing the rewind, and do
the re-linking after the operation. If an error is found while
processing the rewind, then a new base backup needs to be taken.

The refactoring I am mentioning two paragraphs above could definitely be
done to ease user's life, but I would treat that as a new feature and
not a bug. Another approach, way more simple that we could do is to
scan the data folder of the target before doing anything and see if some
files are not writable, and then report this list back to the user. If
we do that even for files which are on the target but not the source
then that would be a loss for some cases, however I would like to
imagine that when using pg_rewind the configuration file set is
consistent across nodes in a majority of cases, so a user could remove
those non-writable files manually. This has also the advantage to be
non-intrusive, so a back-patch is definitely possible.


Attachment Content-Type Size
rewind-readonly-v3.patch text/x-diff 4.9 KB

In response to


Browse pgsql-bugs by date

  From Date Subject
Next Message molofeev 2018-04-04 08:20:55 Re: BUG #15142: ERROR: MultiXactId nnnnn has not been created yet -- apparent wraparound in v9.5
Previous Message Sergei Kornilov 2018-04-04 06:45:21 Re: 答复: autovacuum can not remove dead tuples

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-04-04 07:32:04 Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Previous Message Amit Langote 2018-04-04 07:20:10 Re: [HACKERS] Runtime Partition Pruning