Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From: TipTop Labs <office(at)tiptop-labs(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14999: pg_rewind corrupts control file global/pg_control
Date: 2018-04-05 05:04:40
Message-ID: E1979209-A39E-4229-BBE0-59BC43FB4691@tiptop-labs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers


> On Apr 5, 2018, at 3:38 AM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Wed, Apr 04, 2018 at 02:50:12PM -0400, Tom Lane wrote:
>> When I first started looking at this thread, I wondered if maybe somebody
>> had had in mind to create an active defense against starting a postmaster
>> in an inconsistent target cluster, by dint of intentionally truncating
>> pg_control before the transfer starts and not making it valid again till
>> the very end. It's now clear from looking at the code that that's not
>> what's going on :-(. But I wonder how hard it would be to make it so,
>> and whether that'd be worth doing if it's not too hard.
>
> One safeguard I can see is to add an extra loop just before processing
> the file map to check if a file planned for copy can be sanely opened or
> not. That's simple to do with open_target_file(), followed by
> close_target_file(). If there is an EACCES error, then we complain at
> this early stage, and the data folder can be reused after cleaning up
> the data folder from those files. That's not actually hard to do at
> quick glance, but I may be missing something so let's be careful.
>
> Another approach that we could consider, HEAD-only though, is to extend
> pg_stat_file so as an entry's st_mode is available and we could use that
> to not copy the file's data from the start. That would be actually
> quite clean, particularly for large read-only files.
>
>> Actually, probably a safer way to attack that would be to remove or
>> rename the topmost PG_VERSION file, and then put it back afterwards.
>> That'd be far easier to recover from manually, if need be, than
>> clobbering pg_control.
>>
>> In any case, that seems separate from the question of what to do with
>> read-only files in the data directory. Should we push forward with
>> committing Michael's previous patch, and leave that issue for later?
>
> That one is not good either as long as we don't make the difference in
> the data folder between three types of files:
> 1) Relation files.
> 2) System files which are critical at diverse degrees for the system.
> 3) Custom configuration files.
>
> pg_rewind is able to consider 1) as a separate category as it usually
> needs to fetch a range set of blocks, and puts 2) and 3) in the same
> bucket. It could be possible to split 2) and 3) but the maintenance
> cost is high as for each new system file added in Postgres we would need
> to maintain an extra filtering logic in pg_rewind, something which is
> most likely going to rot, leading to potential corruption problems.
>
> My point is that I have seen on some VMware test beds a kernel going
> rogue because of ESX servers, causing a PostgreSQL-related partition to
> go in read-only mode. So if you use my previous patch, and that the
> partition where the target server's data folder is located turns
> all-of-a-sudden to read-only, there is a risk of silencing real
> failures. Even if the fetched data is ordered, paths like pg_xact and
> pg_twophase are processed after all relation files, so failing to write
> them correctly would silently break a set of transactions :(
>
> So please let me suggest a set of two patches:
> 1) 0001 is a documentation patch which should be back-patched. This
> tells two things:
> 1-1) If pg_rewind fails in the middle of processing, then the best thing
> to do is to re-create the data folder using a new fresh backup.
> 1-2) Be careful of files that the process running pg_rewind is not able
> to write to. If those are located on both the target and the source,
> then pg_rewind will copy it from the source to the target, leading to an
> immediate failure. After removing those unwritable files, be sure to
> clean up anything which has been copied from the source and should be
> read-only, then rebuild the links.
> 2) 0002 is a patch for HEAD to order the set of chunks fetched, because
> this makes the data writes sequentials which is good for performance and
> limits the number of open()/close() on the files of the target's data
> folder.

Thanks for crediting me in patch 0002.

One final word about my original patch, since the commit message for 0001 refers to its limitations: I acknowledge that it cannot cover situations where the order of processing files is such that the second loop detects non-writable files only after it had already done work on some others. It was meant to fix the specific problem I ran into without breaking the regression tests that ship with the source code ("make check"). The main reason that it is not more elaborate is that I did not want to submit anything affecting more than a few lines of code without prior input from the community.

So from my perspective I welcome both 0001 and 0002.

> I think that this is the best answer we can give now as there are a
> couple of ways to improving the handling of any checks, however there
> are side effects to any of them. This also feels a bit like a new
> feature, while the documentation in pg_rewind sucks now for such
> details. I have also spent some time detailing the commit message of
> the first patch about all that, that would be nice to keep in the git
> history.
>
> If the extra checks I am mentioning at the top of this email are worth
> adding, then it is possible to drop 1-2) partially, rewriting it to
> mention that pg_rewind tells the user at early stages about the failure
> and that a data folder can be again reused. This does not change the
> fact that a pg_rewind breaking mid-flight should result in a new base
> backup, so documentation is mandatory anyway. And there are a couple of
> approaches that we could consider here. At least I saw two of them.
> Other folks may see better approaches than I did, who knows..
> Thanks,
> --
> Michael
> <0001-Add-note-in-pg_rewind-documentation-about-read-only-.patch><0002-Enforce-ordering-of-data-chunks-fetched-in-pg_rewind.patch>

--
Christian

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2018-04-05 06:58:40 Re: BUG #14999: pg_rewind corrupts control file global/pg_control
Previous Message David Rowley 2018-04-05 03:02:29 Re: BUG #15143: Window Functions – Paranthese not allowed before OVER term

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2018-04-05 05:10:06 Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key
Previous Message Amit Kapila 2018-04-05 04:47:59 Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key