Re: [bug fix] pg_rewind creates corrupt WAL files, and the standby cannot catch up the primary

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [bug fix] pg_rewind creates corrupt WAL files, and the standby cannot catch up the primary
Date: 2018-03-09 07:11:50
Message-ID: 20180309071149.GB9376@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 01, 2018 at 01:26:32AM +0000, Tsunakawa, Takayuki wrote:
> Our customer hit another bug of pg_rewind with PG 9.5. The attached
> patch fixes this.

Sorry for my late reply. It took me unfortunately some time before
coming back to this patch.

> PROBLEM
> ========================================
>
> After a long run of successful pg_rewind, the synchronized standby
> could not catch up the primary forever, emitting the following message
> repeatedly:
>
> LOG: XX000: could not read from log segment 000000060000028A00000031,
> offset 16384: No error

Oops.

> CAUSE
> ========================================
>
> If the primary removes WAL files that pg_rewind is going to get,
> pg_rewind leaves 0-byte WAL files in the target directory here:
>
> [libpq_fetch.c]
> case FILE_ACTION_COPY:
> /* Truncate the old file out of the way, if any */
> open_target_file(entry->path, true);
> fetch_file_range(entry->path, 0, entry->newsize);
> break;
>
> pg_rewind completes successfully, create recovery.conf, and then start
> the standby in the target cluster. walreceiver receives WAL records
> and write them to the 0-byte WAL files. Finally, xlog reader
> complains that he cannot read a WAL page.

This part qualifies as a data corruption bug as some records are lost by
the backend.

> FIX
> ========================================
>
> pg_rewind deletes the file when it finds that the primary has deleted
> it.

The concept looks right to me. I think that this does not apply only to
WAL segments though. You could have a temporary WAL segment that has
been included in the chunk table, which is even more volatile, or a
multixact file, a snapshot file, etc. In short anything which is not a
relation file and could disappear between the moment the file map is
built and the data is fetched from the source server.

@@ -174,7 +173,7 @@ remove_target_file(const char *path)
return;

snprintf(dstpath, sizeof(dstpath), "%s/%s", datadir_target, path);
- if (unlink(dstpath) != 0)
+ if (unlink(dstpath) != 0 && !ignore_error)
I see. So the only reason why this flag exists is that if a file is
large enough so as it is split into multiple chunks, then the first
unlink will work but not the successive ones. One chunk can be at most
1 million bytes, which is why it is essential for WAL segments. Instead
of ignoring *all* errors, let's ignore only ENOENT and rename
ignore_error to missing_ok as well.

You need to update the comment in receiveFileChunks where an entry gets
deleted with basically what I mentioned above, say:
"If a file has been deleted on the source, remove it on the target as
well. Note that multiple unlink() calls may happen on the same file if
multiple data chunks are associated with it, hence ignore
unconditionally anything missing. If this file is not a relation data
file, then it has been already truncated when creating the file chunk
list at hte previous execution of the filemap."

Adding also a comment on top of remove_target_file to explain what
missing_ok does would be nice to keep track of what the code should do.

The portion about whether data from pg_wal should be transfered or not
is really an improvement that could come in a future version, and what
you are proposing here is more general, so I think that we should go
ahead with it.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2018-03-09 07:14:59 Re: public schema default ACL
Previous Message Kyotaro HORIGUCHI 2018-03-09 07:07:50 Re: Let's remove DSM_INPL_NONE.