Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: TipTop Labs <office(at)tiptop-labs(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14999: pg_rewind corrupts control file global/pg_control
Date: 2018-04-05 01:38:18
Message-ID: 20180405013818.GC1732@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Wed, Apr 04, 2018 at 02:50:12PM -0400, Tom Lane wrote:
> When I first started looking at this thread, I wondered if maybe somebody
> had had in mind to create an active defense against starting a postmaster
> in an inconsistent target cluster, by dint of intentionally truncating
> pg_control before the transfer starts and not making it valid again till
> the very end. It's now clear from looking at the code that that's not
> what's going on :-(. But I wonder how hard it would be to make it so,
> and whether that'd be worth doing if it's not too hard.

One safeguard I can see is to add an extra loop just before processing
the file map to check if a file planned for copy can be sanely opened or
not. That's simple to do with open_target_file(), followed by
close_target_file(). If there is an EACCES error, then we complain at
this early stage, and the data folder can be reused after cleaning up
the data folder from those files. That's not actually hard to do at
quick glance, but I may be missing something so let's be careful.

Another approach that we could consider, HEAD-only though, is to extend
pg_stat_file so as an entry's st_mode is available and we could use that
to not copy the file's data from the start. That would be actually
quite clean, particularly for large read-only files.

> Actually, probably a safer way to attack that would be to remove or
> rename the topmost PG_VERSION file, and then put it back afterwards.
> That'd be far easier to recover from manually, if need be, than
> clobbering pg_control.
>
> In any case, that seems separate from the question of what to do with
> read-only files in the data directory. Should we push forward with
> committing Michael's previous patch, and leave that issue for later?

That one is not good either as long as we don't make the difference in
the data folder between three types of files:
1) Relation files.
2) System files which are critical at diverse degrees for the system.
3) Custom configuration files.

pg_rewind is able to consider 1) as a separate category as it usually
needs to fetch a range set of blocks, and puts 2) and 3) in the same
bucket. It could be possible to split 2) and 3) but the maintenance
cost is high as for each new system file added in Postgres we would need
to maintain an extra filtering logic in pg_rewind, something which is
most likely going to rot, leading to potential corruption problems.

My point is that I have seen on some VMware test beds a kernel going
rogue because of ESX servers, causing a PostgreSQL-related partition to
go in read-only mode. So if you use my previous patch, and that the
partition where the target server's data folder is located turns
all-of-a-sudden to read-only, there is a risk of silencing real
failures. Even if the fetched data is ordered, paths like pg_xact and
pg_twophase are processed after all relation files, so failing to write
them correctly would silently break a set of transactions :(

So please let me suggest a set of two patches:
1) 0001 is a documentation patch which should be back-patched. This
tells two things:
1-1) If pg_rewind fails in the middle of processing, then the best thing
to do is to re-create the data folder using a new fresh backup.
1-2) Be careful of files that the process running pg_rewind is not able
to write to. If those are located on both the target and the source,
then pg_rewind will copy it from the source to the target, leading to an
immediate failure. After removing those unwritable files, be sure to
clean up anything which has been copied from the source and should be
read-only, then rebuild the links.
2) 0002 is a patch for HEAD to order the set of chunks fetched, because
this makes the data writes sequentials which is good for performance and
limits the number of open()/close() on the files of the target's data
folder.

I think that this is the best answer we can give now as there are a
couple of ways to improving the handling of any checks, however there
are side effects to any of them. This also feels a bit like a new
feature, while the documentation in pg_rewind sucks now for such
details. I have also spent some time detailing the commit message of
the first patch about all that, that would be nice to keep in the git
history.

If the extra checks I am mentioning at the top of this email are worth
adding, then it is possible to drop 1-2) partially, rewriting it to
mention that pg_rewind tells the user at early stages about the failure
and that a data folder can be again reused. This does not change the
fact that a pg_rewind breaking mid-flight should result in a new base
backup, so documentation is mandatory anyway. And there are a couple of
approaches that we could consider here. At least I saw two of them.
Other folks may see better approaches than I did, who knows..

Thanks,
--
Michael

Attachment Content-Type Size
0001-Add-note-in-pg_rewind-documentation-about-read-only-.patch text/x-diff 4.0 KB
0002-Enforce-ordering-of-data-chunks-fetched-in-pg_rewind.patch text/x-diff 1.6 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message David Rowley 2018-04-05 02:32:43 Re: BUG #15143: Window Functions – Paranthese not allowed before OVER term
Previous Message David G. Johnston 2018-04-05 00:23:21 Re: BUG #15143: Window Functions – Paranthese not allowed before OVER term

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-04-05 01:44:39 Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key
Previous Message Michael Paquier 2018-04-05 00:06:21 Re: Rewrite of pg_dump TAP tests