Re: BUG #14999: pg_rewind corrupts control file global/pg_control

From: TipTop Labs <office(at)tiptop-labs(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14999: pg_rewind corrupts control file global/pg_control
Date: 2018-01-05 03:36:22
Message-ID: 0D104D1E-6103-49F4-8AB8-B5CD92949E18@tiptop-labs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

> On Jan 5, 2018, at 2:26 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
>
> On Fri, Jan 5, 2018 at 5:06 AM, PG Bug reporting form
> <noreply(at)postgresql(dot)org> wrote:
>> I have encountered a bug in PostgreSQL 10.1: when the target directory for
>> pg_rewind contains a read-only file (e.g. server.key), pg_rewind exits with
>> "could not open target file" (legitimate) and corrupts the control file
>> global/pg_control to size 0 (bug). From now on, pg_rewind always exits with
>> "unexpected control file size 0, expected 8192" and a restore from
>> pg_basebackup is needed.
>
> Likely that's reproducible down to 9.5 where pg_rewind has been
> introduced. I agree that we should do better with failure handling
> here. Corrupting the control file is not cool.

I can already confirm that this also occurs with PostgreSQL 9.6.

>> A patch for branch REL_10_STABLE of repository
>> https://github.com/postgres/postgres, a README, and Dockerfiles for
>> demonstrating both bug and patch are available from
>> https://github.com/tiptop-labs/postgres-patches .
>
> You may not know when github.com is gone, which would cause the data
> you are attaching here to go away. What looks interesting is
> pg_rewind.patch, which is what you are proposing as a fix, right? I am
> attaching it here for PG archives.

Yes, it's pg_rewind.patch. There is a bit of rationale in README, feel free to also attach here if/as needed.

>
> - open_target_file(filename, false);
> + /* Trunc target file for action FILE_ACTION_COPY. */
> + open_target_file(filename, chunkoff == 0);
>
> write_target_range(chunk, chunkoff, chunksize);
> Hm. Let me think more about that as there are quite a few
> distributions that link to SSL files that cannot be written, and
> pg_rewind copies all configuration files in full.

Could you please identify a specific distribution with such properties; I'd like to have a look there too. Thanks.

> - "FROM fetchchunks\n";
> + "FROM fetchchunks\n"
> + "ORDER BY path, begin\n";
> If this is aimed at improving the performance of pg_rewind by making
> chunk writes more sequential, it should be a different patch. I would
> see value in that. If this is aimed to change the order of the files
> to avoid the write corruption, this is wrong.

It may be wrong then; from README: Truncation of the file was moved to the second loop. Truncation occurs there if chunks are written into files at offset 0. This is the case for FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures that these chunks are processed first.

> I would imagine that you could see something similar with the offline
> mode, haven't tested yet though.
> --
> Michael
> <pg_rewind.patch>

--
Christian

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2018-01-05 03:41:55 Re: BUG #14999: pg_rewind corrupts control file global/pg_control
Previous Message Michael Paquier 2018-01-05 01:26:37 Re: BUG #14999: pg_rewind corrupts control file global/pg_control

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-01-05 03:41:55 Re: BUG #14999: pg_rewind corrupts control file global/pg_control
Previous Message Pavan Deolasee 2018-01-05 03:35:02 Re: Faster inserts with mostly-monotonically increasing values